Working With Custom Datasets

SASNets is designed to be relatively easy to extend to any kind of data. This page describes how to use different SANS and SAXS datasets, as well as completely different datasets.

SASModels

SASNets makes use of data generated by a custom fork of SASModels, which contains the custom models used inside SASView. Generating your own SANS data is easy: just run

./gen_models.sh <models> <number> <dim> <npoints> <cutoff> <precision>

where:

  • models is the models that are to be run, or a class of models.
  • number is the number of distinct models to generate.
  • dim is the dimensionality of data (1D or 2D).
  • npoints is the number of points per model.
  • cutoff is the polydisperse cutoff, or ‘mono’.
  • precision is the precision level.

More information on these parameters can be found at http://sasview.org/docs.

The gen_models.sh script is short:

1
2
3
4
5
6
7
8
#!/bin/bash
sasview=( ../sasview/build/lib.* )
sep=$(python -c "import os;print(os.pathsep)")
PYTHONPATH=../bumps${sep}../periodictable${sep}$sasview
export PYTHONPATH
PYOPENCL_COMPILER_OUTPUT=1; export PYOPENCL_COMPILER_OUTPUT
PYOPENCL_CTX=2; export PYOPENCL_CTX
python -m sasmodels.generate_sets $*

It first sets the location of SASView, bumps and periodictable, which are used in generating data for models (lines 2 through 5). It then enables pyopencl compiler output, which will print out warnings when compiling models for the CPU. With CTX, this selects a OpenCL device to compute on. This is an index from

pyopencl.create_some_context()

The exact number can be determined by running this command in a Python shell. The Intel CPU is typically 0, the Intel GPU 1, and the discrete unit (AMD or Nvidia) 2. It is recommended that you use the discrete GPU if possible, as it is many times faster than the CPU/integrated graphics. It is known that some models are broken on certain hardware devices. If a model fails to produce data with “Maths error, bad model”, try forcing SASModels to build using a different device, if you have one.

An example call to gen_models for training would be

./gen_models.sh all 15000 1D 100 mono double

These results are saved (by default) to a PostgreSQL table named “train_data” in a database named “sas_data”, with the username and password “sasnets”. train_data’s schema is:

Column Type
id integer
iq numeric[]
diq numeric[]
model text

We recommend at least 10,000 data points per model for training a neural network for research, and at least 20,000 to 25,000 for production systems. With 15,000 points per model, an epoch with batch size 5 on approximately 800,000 points takes approximately 450 seconds to run. During training, you can optionally specify evaluation data, which gives more accurate statistics on the model than using training data would. Changing line 188 to

1
"INSERT INTO eval_data (iq, diq, model)....

and rerunning the script will randomly generate data and put it in the eval_data table, which has the same columns as the train_data table. We recommend 2500 evaluation points per model.

Depending on the instrument configuration that you are generating data for, parameters within generate_sets.py may have to be edited. Lines 283 and 284 contain the most relevant parts, namely the Q-range and noise. You can optionally read Q and dQ ranges from files and use these instead, which is done in lines 285 through 287. The data object produced by make_data is an object with various instance variables depending on the options passed to the function. q is represented as data.x and dq is represented by data.dx.