Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement ORSO filewriter #27

Merged
merged 20 commits into from
Feb 7, 2024
Merged

Reimplement ORSO filewriter #27

merged 20 commits into from
Feb 7, 2024

Conversation

jl-wynen
Copy link
Member

@jl-wynen jl-wynen commented Feb 2, 2024

Fixes #6 Most of #6 except tracking corrections.

This is still missing saving the actual data, tracking 'corrections', and filling in the remaining 'instrument' data. But it implements the basic mechanism and this part is ready for review.

For reference, adding it to the default pipeline like this:

from essreflectometry import orso as ess_orso
from essreflectometry.amor import orso as amor_orso
from essreflectometry.amor import providers, default_parameters
from essreflectometry.orso import providers as orso_providers
from essreflectometry.types import *
from orsopy import fileio

providers = (
    *providers,
    *amor_orso.providers,
    *orso_providers,
)

params = {
    **default_parameters,
    Filename[Sample]: "sample.nxs",
    Filename[Reference]: "reference.nxs",
    ess_orso.OrsoCreator: ess_orso.OrsoCreator(
        fileio.base.Person(
            name='Jan-Lukas',
            affiliation='ESS',
        )
    ),
}
pipeline = sciline.Pipeline(providers, params=params)

print(pipeline.compute((ess_orso.OrsoDataSource, ess_orso.OrsoReduction)))

results in

DataSource(
           owner=Person(name='J. Stahn', affiliation=None, contact='[email protected]'),
           experiment=Experiment(title='commissioning', instrument='AMOR', start_date=datetime.datetime(2020, 11, 25, 16, 3, 10), probe='neutron', facility='SINQ'),
           sample=Sample(name=None),
           measurement=Measurement(
                                   instrument_settings=None,
                                   data_files=[File(file='sample.nxs')],
                                   additional_files=[File(file='reference.nxs', comment='supermirror')],
                                   ),
           )
Reduction(
          software=Software(name='ess.reflectometry', version='23.11.1.dev18+gd1cfaa6.d20240124', platform='Linux'),
          timestamp=datetime.datetime(2024, 2, 2, 14, 58, 14, 919380, tzinfo=datetime.timezone.utc),
          creator=Person(name='Jan-Lukas', affiliation='ESS'),
          corrections=[],
          )

With this graph:
Screenshot_20240202_155917

Comment on lines +67 to 68
events = dg['instrument']['multiblade_detector']['data'].copy(deep=False)
events.bins.coords['tof'] = events.bins.coords.pop('event_time_offset')
Copy link
Member

@SimonHeybrock SimonHeybrock Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the copy is meant to prevent in-place modification, note that is does not affect the bin content, i.e., the line second line modifies the buffer of the original. See scipp/scipp#2773.

Comment on lines 58 to 60
# TODO populate timestamp
# doesn't work with a local file because we need the timestamp of the original,
# SciCat can provide that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference to the experiment time above? Isn't the time in NXentry actually a measurement time, not an experiment time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference to the experiment time above?

There is only one experiment time (sample run) but we have multiple files that each have a creation time.

Isn't the time in NXentry actually a measurement time, not an experiment time?

What is the difference?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know how they define it, but an experiment could consist of multiple measurements (e.g., many samples).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we need to extend how we handle this here. E.g., use the max of all times in sample files.

We can change this here to use the time recorded in the NXentry instead of the creation time of the file. These are not the same but, as long as there is only one entry, they are essentially equivalent.

"""Parse ORSO sample data from raw Amor NeXus data."""
if not raw_data.get('sample'):
return OrsoSample(data_source.Sample.empty())
raise NotImplementedError('Amor NsXus sample parsing is not implemented')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise NotImplementedError('Amor NsXus sample parsing is not implemented')
raise NotImplementedError('Amor NeXus sample parsing is not implemented')

orso.data_source.measurement.scheme = 'angle- and energy-dispersive'
orso.reduction.software = fileio.reduction.Software(
'scipp-ess', __version__, platform.platform()
"""ORSO utilities for Amor."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unclear why all or most of the below is in the Amor submodule. It looks pretty generic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It parses stuff from the loaded NeXus. I don't feel confident enough to know what an ESTIA NeXus file looks like to put this into a common module.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is part of the NeXus standard, so generally all those fields should exist.

from ..types import Filename, RawData, Reference, Run, Sample


def parse_orso_experiment(raw_data: RawData[Run]) -> OrsoExperiment[Run]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are concerned about keeping alive RawData longer than necessary, consider extracting meta data first, or loading neutron data and meta data independently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider extracting meta data first,

How would that work?

loading neutron data and meta data independently.

I considered it. But that would either mean loading the whole file twice which is slow. Or writing two more complicated loaders.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is quite simple to write such a loader. We can chat tomorrow.

Comment on lines 16 to 37
class OrsoExperiment(
sciline.Scope[Run, data_source.Experiment], data_source.Experiment
):
"""ORSO experiment for a run."""


class OrsoInstrument(
sciline.Scope[Run, data_source.InstrumentSettings], data_source.InstrumentSettings
):
"""ORSO instrument settings for a run."""


class OrsoOwner(sciline.Scope[Run, orso_base.Person], orso_base.Person):
"""ORSO owner of a file."""


class OrsoReduction(sciline.Scope[Run, reduction.Reduction], reduction.Reduction):
"""ORSO measurement for a run."""


class OrsoSample(sciline.Scope[Run, data_source.Sample], data_source.Sample):
"""ORSO sample of a run."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to support all this for the Reference measurement? Or is it recording only for the sample run (i.e., can we use NewType instead of a "generic"?

Copy link
Member Author

@jl-wynen jl-wynen Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on whether we want to process a file in isolation and store it as ort. Or if the rest of the pipeline needs to use the metadata. See, e.g., #27 (comment) Though that would change if we implement scipp/scippneutron#473. With this, the ORSO stuff could be more specialised.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this again, I still do not see the case for parametrizing all of the above with the run type. I feel it would be simpler and clear to just use the sample everywhere.

  • The sample and reference are part of the same experiment.
  • The same instrument is used, it is always a reference measurement at that instrument.
  • No one cares about the owner of the reference file, I'd say. I think typically it is done by the same user during their experiment.
  • There is just one reduction, not a reduction for the sample and one for the reference (or if there is, the reference run would be the sample run?).

Comment on lines 82 to 89
# We simply assume that the owner of the reference measurement
# has no claim on this data.
if (sample_experiment.facility != reference_experiment.facility) or (
sample_experiment.instrument != reference_experiment.instrument
):
raise ValueError(
'The sample and reference experiments were done at different instruments'
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like an odd place to check this. Should this be done as part of the "data" path in the task graph?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. But it should still be checked here, no? Otherwise the metadata providers depend on using specific data providers.

def _ascii_unit(unit: sc.Unit) -> str:
unit = str(unit)
if unit == 'Å':
return 'angstrom'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha no love for the Swedish A here I see... But I think it's good to use the international version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ORSO requires ASCII strings for units. But nowadays, utf-8 should be fine

Copy link
Contributor

@jokasimr jokasimr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I guess the big difficulty left here is how to specify what corrections were applied without inspecting the TaskGraph. Maybe the user has to manually enter those to be able to build the Orso file?

I think it is difficult to get this 'perfect' now without any reflectometry user input. So in my opinion this can be merged as is. But I would still like to hear what you think about adding file hashes or some other identifier that is more robust than the file name.

qz = sc.midpoints(qz)
r = sc.values(iofq.data)
sr = sc.stddevs(iofq.data)
if sigma_q is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need two cases here? Wouldn't it be easier to just either get sigma_Q from the data or from an extra argument to this provider?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be. And I'm going to write a provider that creates an OrsoDataset which will then request all it needs.

return OrsoMeasurement(
data_source.Measurement(
instrument_settings=instrument,
data_files=[orso_base.File(file=os.path.basename(sample_filename))],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a limitation of the Orso standard, so probably little we can do about it. But, identifying a file by the file name seems an unreliable way to achieve reproducibility.
Best way would of course be to store the file itself, second best would be to store a hash and preferably also instructions on how to access the file.

Do you think there is anything we can do about that? Or do we just follow the standard for now? I looked quickly at the orso standard spec but could not find any metadata field where that kind of extra information could be stored.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ORSO is not like NeXus where you can just write whatever you want. So we are limited by the standard. The only thing we can do (in the short term) is to store any additional info in the comment field. As I understand, this is a free form text filed. But it's meant to be read by humans, not programs...

If you think this is important enough (and I think it is), consider raising it with ORSO: https://github.com/reflectivity/file_format

Copy link
Contributor

@jokasimr jokasimr Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a temporary solution we could just append
sha256:fff4b52fd585a4ca3e9709a64da93209f32db841a07ff7a1cde12db2f8bfe0a3 (or maybe md5 but it probably doesn't matter) to the end of the comment.

I'd happily raise this with ORSO and ask if they can add an optional hash field to the File object.

Copy link
Contributor

@jokasimr jokasimr Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created an issue: reflectivity/file_format#15

data_files=[orso_base.File(file=os.path.basename(sample_filename))],
additional_files=[
orso_base.File(
file=os.path.basename(reference_filename), comment='supermirror'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so you can add comments. Then we can add a file hash and maybe even instructions for how to access the file.

@jl-wynen
Copy link
Member Author

jl-wynen commented Feb 6, 2024

ORSO also allows specifying the machine that the software runs on and the old code used this: https://github.com/scipp/ess/blob/5372ef6e91f2ee8bf4903c3db081b949e76952fd/src/ess/amor/orso.py#L47 I omitted this here because storing a user computer's name doesn't help much and seems like exposing private information. What do you think @jokasimr ?

@jokasimr
Copy link
Contributor

jokasimr commented Feb 6, 2024

ORSO also allows specifying the machine that the software runs on and the old code used this: https://github.com/scipp/ess/blob/5372ef6e91f2ee8bf4903c3db081b949e76952fd/src/ess/amor/orso.py#L47 I omitted this here because storing a user computer's name doesn't help much and seems like exposing private information. What do you think @jokasimr ?

Yes I noticed that as well. It seems unnecessary.

@jl-wynen jl-wynen marked this pull request as ready for review February 6, 2024 14:06
@jl-wynen
Copy link
Member Author

jl-wynen commented Feb 6, 2024

It is now complete except for listing corrections.

We still need to figure out how do do that. The approach with tags that I outlined in the issue #6 will stop working when scipp/sciline#116 is merged.

Copy link
Contributor

@jokasimr jokasimr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@jl-wynen
Copy link
Member Author

jl-wynen commented Feb 6, 2024

I forgot to add tests. Did that now

return OrsoExperiment(
data_source.Experiment(
title=raw_data['title'],
instrument=raw_data['name'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@jl-wynen jl-wynen Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we do need to make this Amor-specific because the files don't follow the standard. (#27 (comment))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, in our test file, the name shows up in both places

data_source.Experiment(
title=raw_data['title'],
instrument=raw_data['name'],
facility=raw_data['facility'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not in the NeXus standard. Should we hardcode it to ESS?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to use get. We should only hard code anything in an instrument-specific module. With Amor, ESS would be wrong.

Comment on lines +91 to +87
# TODO populate timestamp
# doesn't work with a local file because we need the timestamp of the original,
# SciCat can provide that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use the NXentry/start_time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Because the timestamp is the last modification time. I.e., earliest the end time of the experiment, not the start time. And possibly later if the file was modified for some reason.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then use the end time? Modifying Raw files is not allowed, to my knowledge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no end time in our Amor file. I could add something now that gets the time if it exists and falls back to None otherwise. But I would rather wait for a more general mechanism in scippneutron that could also handle scicat timestamps and select the most appropriate one.

Comment on lines 16 to 37
class OrsoExperiment(
sciline.Scope[Run, data_source.Experiment], data_source.Experiment
):
"""ORSO experiment for a run."""


class OrsoInstrument(
sciline.Scope[Run, data_source.InstrumentSettings], data_source.InstrumentSettings
):
"""ORSO instrument settings for a run."""


class OrsoOwner(sciline.Scope[Run, orso_base.Person], orso_base.Person):
"""ORSO owner of a file."""


class OrsoReduction(sciline.Scope[Run, reduction.Reduction], reduction.Reduction):
"""ORSO measurement for a run."""


class OrsoSample(sciline.Scope[Run, data_source.Sample], data_source.Sample):
"""ORSO sample of a run."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this again, I still do not see the case for parametrizing all of the above with the run type. I feel it would be simpler and clear to just use the sample everywhere.

  • The sample and reference are part of the same experiment.
  • The same instrument is used, it is always a reference measurement at that instrument.
  • No one cares about the owner of the reference file, I'd say. I think typically it is done by the same user during their experiment.
  • There is just one reduction, not a reduction for the sample and one for the reference (or if there is, the reference run would be the sample run?).

Comment on lines 137 to 138
sample_experiment: Optional[OrsoExperiment[Sample]],
reference_experiment: Optional[OrsoExperiment[Reference]],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above, I don't think the reference measurement is or should be an "experiment", i.e., I'd remove this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you use the opposite definition of 'experiment' and 'measurement' as ORSO. An experiment is one run at an instrument, i.e., sample and reference are separate experiments. And a measurement is a combination of any number of experiments.

But yes, we could ignore the reference experiment. build_orso_measurement already only includes the sample instrument.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide a reference? Google found this: https://www.reflectometry.org/projects/file_formats/dictionaries/, and it seems to be consistent with "my" (or rather NeXus') definition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual specification is here: https://github.com/reflectivity/file_format/blob/master/specification.md

But you are right. 'Experiment' refers to a 'series of measurements'. However, the 'measurement' key also refers to a series.

orso.data_source.measurement.scheme = 'angle- and energy-dispersive'
orso.reduction.software = fileio.reduction.Software(
'scipp-ess', __version__, platform.platform()
wavelength = events_in_wavelength.coords['wavelength']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that there are wavelength coords, which may not be the case if we do not start with a TOF coords, or do not have wavelength masks. If not found, fall back to events_in_wavelength.bins.coords['wavelength'].

Comment on lines 25 to 34
return OrsoInstrument(
orso_data_source.InstrumentSettings(
wavelength=orso_base.ValueRange(
min=float(wavelength.min().value),
max=float(wavelength.max().value),
unit=_ascii_unit(wavelength.unit),
),
incident_angle=orso_base.ValueRange(
min=float(incident_angle.min().value),
max=float(incident_angle.max().value),
unit=_ascii_unit(incident_angle.unit),
),
polarization=None, # TODO how can we determine this from the inputs?
)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which part here is Amor specific?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Knowing which inputs to use. This depends on the specific pipeline.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, build_orso_measurement is also specific to the pipeline because it sets the reference file as 'supermirror'.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying the other ESS instruments will operate differently?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. Our OFFSPEC workflow normalises by direct beam: https://scipp.github.io/ess/instruments/external/offspec/offspec_reduction.html#Normalisation-of-sample-by-direct-beam Maybe there will be differences between instruments or even operating modes (collimated vs divergent).

)


def build_orso_iofq_dataset(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which part here is Amor specific?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Knowing which columns to save. I guess that is not so much amor-specific but workflow-specific. Do we have an established pattern for handling this? Should the providers simply not be part of a providers tuple?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any reason to believe the workflows at ESS will be different?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how many different types of workflows we will have. Will the only thing we ever save to ORSO be I(Q)?

@jl-wynen jl-wynen merged commit d566404 into main Feb 7, 2024
3 checks passed
@jl-wynen jl-wynen deleted the orso-filewriter branch February 7, 2024 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Re-implement Orso file writing
3 participants