Next Generation XAFS Data Library

This page is a Work-in-Progress Proposal for how to build a Next Generation XAFS Data Library The main idea is to allow XAFS Data Library that

  1. facilitate the exchange XAFS spectra, particularly on model compounds, such as found in Model Compound Libraries.
  2. store and manage multiple spectra, but still easily export to plain ASCII files.
  3. allow users to have public and private libraries of spectra.
  4. make it easy to share sets of XAFS spectra.

To date, there have been several1, 2, 3, 4 Web-based Databases. While these all organize XAFS spectra on model compounds, they all have shortcomings such as incomplete data, difficulty of adding new data, and incompatible formats.

The aim here is to create a way to store multiple XAFS spectra in a manner that can be used within dedicated applications and embedded into existing data processing software with minimal effort. This will also alleviate many of the problems associated with the current databases.

Motivation

Storing and Exchanging XAFS Data is a common need for everyone using XAFS. In particular, retrieving data on Standards or Model Compounds is a continuing need for both XANES and EXAFS analysis. Additionally, data taken on samples at different facilities and beamlines need to be compared and analyzed together. At this point, there is no commonly accepted data format for XAFS data. There have been a few attempts 5 to standardize the traditional "ASCII Column File". To be sure, ASCII Column Files have some strong appeal. Most data collection software saves such files, and most processing and analysis software use some variation of this format. In addition, such files may be easily read by humans and used in a wide variety of third party applications. Still, most ASCII Column Files need some intimate knowledge of the data layout to use the data. The lack of a standard ASCII format is a serious problem. The work here is parallel to the efforts to come up with a standard. and will be able to convert data into such standard file formats. .

For the effort here, in which storing multiple spectra is a key requirement, ASCII Column Data Files offer no standard way to hold multiple spectra. In order to facilitate the sharing data between beamlines and the exchange (and recognition) of high-quality data on standards, it is useful to have Data Libraries or Repositories which a scientist can use at will and exchange with others. A key point is that not only should data be well-formatted and vetted for quality, but should also be easily be viewed as a part of Suite of Spectra.

A Data Library is often envisioned as a single centralized Library (see http://xafs.org/Databases) which is meant to house Standard Data with some assurance of quality and the expectation that the data is to be shared with the entire community. In contrast, many researchers keep their own set of data closely guarded and do not want to share their data. Another model for a Data Library is a set of spectra being shared between collaborators, but then possibly allowed to have a wider distribution (say, after some paper has been published). I am considering all of these as legitimate use cases, and am more interested here in creating a format that can be used for all these cases.

Implementation Overview

With the motivation from the previous section, the goals for the format and library are:

  1. store spectra for exchange, especially for model compounds. Raw data, direct from the beamline will probably need to be converted to this format.
  2. store information about the sample, measurement conditions, etc.
  3. store *multiple* spectra, either on the same sample or multiple samples, and possibly taken at many facilities.
  4. provide programming libraries and simple standalone applications that can read, write, and manage such data libraries. Programming libraries would have to support multiple languages.

There are a few reasonable ways to solve this problem. What follows below is a methods which makes heavy use of relational databases and SQL. The principle argument here is that relational databases offer a well-understood, proven way to store data with extensible meta-data. The use of SQL also makes the programming libraries simpler, as they can rely on tested SQL syntax to access the underlying data store.

Development Code and Data

As the XAS Data Library is being developed, code and examples will be available at http://github.com/newville/xasdatalib

Why SQLite

I propose using SQLite, a widely used, Free relational database engine as the primary store for the XAFS Data Library. A key feature of SQLite is that it needs no external server or configuration -- the database is contained in a single disk file. SQLite databases can accessed with a variety of tools6, 7.

Challenges Using SQL Tables for Numerical Array

SQL-based relational databases may not be the most obvious choice for storing scientific data composing of arrays of related data. One obvious limitation is that relational databases don't store array data very well. Thus storing array data in a portable way within the confines of an SQL database needs special attention. The approach adopted here is to JSON, which can encapsulate an array, or other complex data structure into a string.

Using JSON to store array data

JSON -- Javascript Object Notation, provides a standard, easy-to-use method for encapsulating complex data structures into strings that can be parsed and used by a large number of programming languages as the original data. In this respect, the requirements for the XAS Data Library -- numerical arrays of data -- are fairly modest. Storing array data in strings is, of course, what ASCII Column Files have done for years, only not with the benefit of a standard programming interface to read them. As an example, an array of data [8000, 8001.0 , 8002.0] would be encoded in json as

'[8000, 8001.0, 8002.0]'

This is considerably easier and lighter weight than using XML to encode array data.

In addition to encoding numerical arrays, JSON can also encode an associative array (also known as a Hash Table, Dictionary, Record, or Key/Value List. This can be a very useful construct for storing attribute information. It might be tempting to use such Associative Arrays for many pieces of data inside the database, this would prevent those data from being used in SQL SELECT and other statements: such data would not be available for making relations. But, as Associative Arrays can so useful and extensible, several of the tables in the database include a attributes column that is always stored as text. This data will be expected to hold a JSON-encoded Associative Array that may be useful to complement the corresponding notes column. This data cannot be used directly in searching the database, but may be useful to particular applications.

Other Challenges When Using SQLite

While robust, powerful and compliant with SQL standards, SQLite does not always provide as rich set of Data Types as some SQL relational databases. In particular for the design here, SQLite does not support Boolean values or Enum fields. Integer Values are used in place of Boolean Values. Enum values (which may have been used to encode Elements, Collection Modes, etc) are implemented as indexes into foreign tables, and JOINs must be used to relate the data in the tables.

Tables and Database Schema

The principle kind of data held in a XAFS Data Library is XAFS Spectra. Additionally, several other kinds of data can be useful to include, such as for sample preparation, measurement conditions, and so on. Of course, a key feature of a multi-spectral database is to be able to combine several spectra into a Suite of Spectra, and also to identify the people adding data to a library. Thus the XAFS Data Library contains the following main tables:

Table Name

Description

spectra

main XAS spectra, pointers to other tables

sample

Samples

crystal_structure

Crystal structures

person

People

citation

Literature or Other Citations

format

Data Formats

suite

Spectra Suites

facility

Facilities

beamline

Beamlines

monochromator

Monochromators

mode

Modes of Data Collection

ligand

Ligands

element

names of Elements

edge

names of x-ray Edges

energy_units

units for energies stored for a spectra

info

General Information, version etc

While some of these tables (spectra, sample) are fairly complex, many of the tables are really quite simple, holding a few pieces of information.

In addition there are a few Join Tables to tie together information and allow Many-to-One and Many-to-Many relations. These tables include

Table Name

Description

spectra_mode

mode(s) used for a particular spectra

spectra_ligand

ligand(s) present in a particular spectra

spectra_suite

spectra contained in a suite

spectra_rating

People's comments and scores for Spectra

suite_rating

People's comments and scores for Suites

A key feature of this layout is that a Suite is very light-weight, simply comprising lists of spectra. Multiple suites can contain an individual spectra, and a particular spectra can be contained in multiple suites without repeated data.

The tables are described in more detail below. While many are straightforward, a few tables may need further explanation.

Spectra Table

This is the principle table for the entire database, and needs extensive discussion. Several of the thorniest issues have to be dealt with in this table, making this likely to be the place where most attention and discussion should probably be focused.

   1 --
   2 create table spectra (
   3         id integer primary key
   4         name text not null, 
   5         notes text, 
   6         attributes text, 
   7         file_link text, 
   8         data_energy text, 
   9         data_i0 text default '[1.0]', 
  10         data_itrans text default '[1.0]', 
  11         data_iemit text default '[1.0]', 
  12         data_irefer text default '[1.0]', 
  13         data_dtime_corr text default '[1.0]', 
  14         calc_mu_trans text default '-log(itrans/i0)', 
  15         calc_mu_emit text default '(iemit*dtime_corr/i0)', 
  16         calc_mu_refer text default '-log(irefer/itrans)', 
  17         notes_i0 text, 
  18         notes_itrans text, 
  19         notes_iemit text, 
  20         notes_irefer text, 
  21         temperature text, 
  22         submission_date datetime, 
  23         collection_date datetime, 
  24         reference_used integer, 
  25         energy_units_id   -- > energy_units table
  26         monochromator_id  -- > monochromator table
  27         person_id         -- > person table
  28         edge_id           -- > edge table
  29         element_z         -- > element table
  30         sample_id         -- > sample table
  31         beamline_id       -- > beamline table
  32         format_id         -- > format table
  33         citation_id       -- > citatione table
  34         reference_id      -- > sample table (for sample used as reference
  35 );

We'll discuss the table entries more by grouping several of them together. First, Each entry in the spectra table contains links to many other tables.

spectra Column Name

Description

energy_units_id

index of energy_units table

person_id

index of person table for person donating spectra

edge_id

index of edge table for X-ray Edge

element_z

index of element table for absorbing element

sample_id

index of sample table, describing the sample

reference_id

index of sample table, describing the reference sample

beamline_id

index of the beamline where data was collected

monochromator_id

index of the monochromator table for mono used

format_id

index of the format table for data format used

citation_id

index of the citation table for literature citation

Next, the table contains ancillary information (you may ask why some of these are explicit while others are allowed to be put in the attributes field).

spectra Column Name

Description

notes

any notes on data

attributes

JSON-encoded hash table of extra attributes

temperature

Sample temperature during measurement

submission_date

date of submission

reference_used

Boolean (0=False, 1=True) of whether a Reference was used

file_link

link to external file

Here, reference_used means whether data was also measured in the reference channel for additional energy calibration . If 1 (True), the reference sample must be given. The file_link entry would be the file and path name for an external file. This must be relative to the directory containing database file itself, and cannot be an absolute path. It may be possible to include URLs, ....

Finally, we have the information for internally stored data arrays themselves

spectra Column Name

Description

Default

data_energy

JSON data for energy

--

data_i0

JSON data for I_0 (Monitor)

1.0

data_itrans

JSON data for I_transmission (I_1)

1.0

data_iemit

JSON data for I_emisssion (fluorescence, electron yield)

1.0

data_irefer

JSON data for I_trans for reference channel

1.0

data_dtime_corr

JSON data for Multiplicative Deadtime Correction for I_emit

1.0

calc_mu_trans

calculation for mu_transmission

-log(dat_itrans/dat_i0)

calc_mu_emit

calculation for mu_emission

dat_iemit * dat_dtime_corr / dat_i0

calc_mu_refer

calculation for mu_reference

-log(dat_irefer/dat_itrans)

calc_energy_ev

calculation to convert energy to eV

None

notes_energy

notes on energy

notes_i0

notes on dat_i0

notes_itrans

notes on dat_itrans

notes_iemit

notes on dat_iemit

notes_irefer

notes on dat_irefer

The data_***** entries will be JSON encoded strings of the array data. The calculations will be covered in more detail below. Note that the spectra_mode table below will be used to determine in which modes the data is recorded.

Data Storage

As alluded to above, the data_***** will be stored as JSON-encoded strings.

==== Encoding Calculations ====

The calculations of mu in the various modes (transmission, fluorescence) are generally well defined, but it is possible to override them by explicitly documenting the expression used to calculate mu. Note that this expression should not be expected to be fully and correctly evaluated by the database -- it is meant for human reading.

Energy Units

There are several distinct conventions for storing the data for energy values from monochromators in the XAFS community. As most beamlines select energy with a double-crystal monochromator, many beamlines store "raw motor positions" or steps for the monochromator angle. Some beamlines store the angle in degrees, and ome store energy in eV or keV. If the details of the monochromator crystal (d-spacing and motor steps per angular unit) are known, then all of these data can be inter-converted.

A possible advantage of saving angular information instead of energy in eV is that energy re-calibration may be easier in some cases. This generally assumes that re-calibration would involve changing an angular offset or monochromator d-spacing, each of which is difficult to tell from a single spectra without additional information about the system. In any event, saving all the "meta data" about the monochromator (d spacing used to convert angle to energy, and possibly steps per angular unit) is recommended.

To use XAFS data, one generally wants the energy units to be eV or keV. Thus, in all cases, knowing how to convert the measured "energy" data, whether it be in steps, degrees, or keV into eV is necessary to use the data.

The energy_units table simply holds conventional names for energy units:

   1 create table energy_units (
   2         id INTEGER NOT NULL, 
   3         units TEXT NOT NULL, 
   4         notes TEXT, 
   5         attributes TEXT, 
   6         PRIMARY KEY (id), 
   7         UNIQUE (units)
   8 );

and will typically be initialized to have these entries:

energy_units units

Notes

eV

electron Volts

keV

kilo-electron Volts

degrees

angle in degrees for a Bragg monochromator. Need mono dspacing!!

step

angular steps for a Bragg monochromator. Need mono dspacing and steps_per_degree!!

more entries can be added as needed.

The monochromator table will hold the details of d-spacing and steps per degree for any particular monochromator.

Since each spectra has an entry for which energy_units and monochromator are used, the data in a Spectra's energy column should be convertible to eV.

Sample Table

The Sample table holds basic information about the sample measured. Each Spectra has a pointer to a Sample entry, and of course one Sample may be measured several times.

   1 -- sample information
   2 create table sample (
   3         id integer not null,  -- primary key
   4         name text not null,  -- must be unique!
   5         notes text, 
   6         attributes text, 
   7         formula text, 
   8         material_source text, 
   9         person_id integer,   --> person table
  10         crystal_structure_id integer,  --> crystal_structure table
  11 );

Note that the material_source and formula information here are plain text. There are also pointers to the Person providing (or documenting) the Sample, and possibly to the Crystal Structure.

Crystal_Structure Table

The Crystal Structure table is meant to hold crystallographic information about a Sample. The idea is that the contents of a file describing the crystal structure (CIFS, PDB format, atoms.inp for example) can be held here. These are plain text values, meant for human reading.

   1 -- crystal information (example format = CIFS , PDB, atoms.inp)
   2 create table crystal (
   3   id          integer primary key ,  -- primary key
   4   format      text not null,
   5   data        text not null,
   6   notes       text,
   7   attributes  text);

Ligand Table

The Ligand table holds names of ligands that might be present in a Spectra.

   1 create table ligand (
   2    id integer primary key, 
   3    name text,
   4    notes text);
   5 
   6 create table spectra_ligand (
   7   id       integer primary key, 
   8   ligand   integer not null,     --> ligand table
   9   spectra  integer not null);    --> spectra table

Person Table

The Person table holds names and email addresses of people who enter Spectra and Samples, and who make Suites and enter ratings.

   1 create table person (
   2   id           integer primary key , 
   3   email        text not null unique,
   4   first_name   text not null,
   5   last_name    text not null,
   6   sha_password text not null);

Citation Table

   1 create table citation (
   2   id           integer primary key , 
   3   journal      text,
   4   authors      text,
   5   title        text,
   6   volume       text,
   7   pages        text,
   8   year         text,
   9   notes        text,
  10   attributes   text,  
  11   doi          text);

Format Table

   1 -- spectra format: table of data formats
   2 --
   3 --  name='internal-json' means data is stored as json data in spectra table
   4 -- 
   5 create table format (
   6   id          integer primary key, 
   7   name        text, 
   8   notes       text,
   9   attributes  text);
  10 
  11 insert into format (name, notes) values ('internal-json', 'Read dat_*** columns of spectra table as json');

Suite Table

   1 --  Suite:  collection of spectra
   2 create table suite (
   3   id          integer primary key , 
   4   person      integer not null,     -- > person table
   5   name        text not null,
   6   notes       text,
   7   attributes  text);
   8 
   9 -- SUITE_SPECTRA: Join table for suite and spectra
  10 create table spectra_suite (
  11   id       integer primary key , 
  12   suite    integer  not null,     -- > suite table
  13   spectra  integer  not null);    -- > spectra table

Rating Table

A rating is a numerical score given to a Spectra or a Suite of Spectra by a particular person. Each score can also be accompanied by a comment.

While not enforced within the database itself, the scoring convention should be Amazon Scoring: a scale of 1 to 5, with 5 being best.

   1 create table rating (
   2   id         integer primary key , 
   3   person     integer  not null,    -- > person table
   4   spectra    integer,              -- > spectra table
   5   suite      integer,              -- > suite table
   6   score      integer,
   7   comments   text);

Monochromator and Mode Tables

These two tables simply list standard monochromator types and data collection modes.

   1 -- Monochromator descriptions
   2 create table monochomator (
   3    id integer primary key, 
   4    name             text, 
   5    lattice_constant text, 
   6    steps_per_degree text, 
   7    notes            text,
   8    attributes       text);
   9 
  10 -- XAS collection modes ('transmission', 'fluorescence', ...)
  11 create table mode (
  12    id  integer primary key, 
  13    name text, 
  14    notes text);
  15 insert into  mode (name, notes) values ('transmission', 
  16                                      'transmission intensity through sample');
  17 insert into  mode (name, notes) values ('fluorescence, total yield', 
  18                                 'total x-ray fluorescence intensity, as measured with ion chamber');
  19 insert into  mode (name, notes) values ('fluorescence, energy analyzed', 
  20                                  'x-ray fluorescence measured with an energy dispersive (solid-state) detector.  These measurements will often need to be corrected for dead-time effects');
  21 insert into  mode (name, notes) values ('electron emission', 'emitted electrons from sample');
  22 insert into  mode (name, notes) values ('xeol', 'visible or uv light emission');
  23 
  24 
  25 create table spectra_modes (
  26   id       integer primary key , 
  27   mode     integer  not null,   -- > collection_mode 
  28   spectra  integer  not null);  -- > spectra table

Beamline and Facility Tables

These two tables list X-ray (synchrotron) facilities and particular beamlines.

   1 -- beamline description 
   2 --    must have a facility
   3 --    a single, physical beamline can be represented many times for different configurations
   4 create table beamline (
   5   id            integer primary key ,  
   6   facility      integer  not null,    --> facility table
   7   name          text, 
   8   xray_source   text, 
   9   monochromator integer,   -- > monochromator table (optional)
  10   notes         text,
  11   attributes    text);
  12 
  13 -- facilities
  14 create table facility (
  15   id integer primary key, 
  16   name         text not null unique, 
  17   notes        text,
  18   attributes   text);

Note that a monochromator is optional for a beamline.

Element and Edge Tables

These two tables simply list standard symbols and names of the elements of the periodic table, and the standard names for the x-ray absorption edges. The schema are

   1 create table element (z integer primary key, 
   2                       symbol text not null unique, 
   3                       name text);
   4 insert into  element (z, symbol, name) values (1, 'H', 'hydrogen');
   5 insert into  element (z, symbol, name) values (2, 'He', 'helium');
   6  
   7 
   8 create table edge (id integer primary key, 
   9                    name text not null unique, 
  10                    level text);
  11 insert into  edge (name,  level) values ('K', '1s');
  12 insert into  edge (name,  level) values ('L3', '2p3/2');
  13 insert into  edge (name,  level) values ('L2', '2p1/2');
  14 insert into  edge (name,  level) values ('L1', '2s');

Supported Low-Level Data Formats

Initially, the principle data format for the XAS Data library will be Internally Stored, JSON-encoded data arrays. Storing data internally has the advantage of preserving the database as a single file. JSON-encoded arrays have the advantage of being readily useful to many languages and environments. Alternate internal formats could be allowed, but no such formats are yet identified.

External data

Example Queries

Programming Interface(s)

Proposed Format for a single data set from Bruce Ravel and Ken McIvor

Talks from January, 2010 Workshop on HDF5 for Synchrotron Data

Upcoming Workshop (April 2011) on Improving Data for XAFS

Notes

XasDataLibrary (last edited 2011-03-18 15:43:03 by MattNewville)