
INTRODUCTION

Spacing is the difference between data points after sorting them.  It is
stable and consistent around the mode of a variate.  At the edges, it
increases rapidly in the tails.  Overall, the density of the spacing
resembles a 'U' with a broad base and steep sides.  If we combine variates,
then the spacing will remain flat at each mode and will begin to increase
between them, forming local peaks at the anti-modes.  These features, flats
and peaks, not only mark multiple modes but locate them.  Analyzing
modality, then, requires peak and flat detectors and tests to determine if
they are significant.

The spacing is usually a noisy signal and some smoothing is needed to
identify these features.  We can use a traditional low-pass filter, or can
look at what we call the interval spacing, where the difference is taken
over a larger gap than adjacent sorted points.  The interval spacing is
equivalent to using a rectangular or running mean low-pass kernel, and
suffers from that filter's limitations: large sidelobes pass high
frequency components and create a rough signal.  But this opens the
possibility for a different class of tests.

There are three different classes that make progressively fewer assumptions
about the setup.

1. Parametric Models
We have developed parametric models of the height of a peak or length of a
flat that depend on an assumed null distribution of the variates, the
amount of data, and the low-pass filter.  For peaks there is a clear,
conservative choice of distribution (an asymmetric Weibull) but the model
is complicated and degrades for very small and large data sets.  The flats
model is simpler and fits better, but varies with the null choice.  We
provide three, one that is quicker to accept flats (the Weibull), one that
is more conservative (a Gumbel), and a compromise (a logistic) that is
used by default.

2. Runs
The roughness of the interval spacing allows us to look at the runs that
form peaks, sequences of increasing, decreasing, or tied values.  Wald and
Wolfowitz derived the statistics of the expected number of runs in two
symbols, and Kaplansky and Riordan extended the result to include the three
we have.  A second test determines the probability of the longest sequence
by modeling the interval spacing as a Markov chain.

3. Data-Based
A third test uses permutations of the runs within the feature to determine
the distribution of heights, which sets the likelihood of peak.  It is an
example of a sample test, using only the data to estimate the feature's
significance.  Similarly, we can use the difference of the filtered spacing,
low-pass or interval, as a pool to estimate the distribution for the
feature, peak or flat, without any assumptions.  We call these excursion
tests.

DimodalPy is a Python wrapper to a C version of Dimodal, itself an R package
that implements the feature detectors and tests.  The source code for all
versions is available at https://www.primachvis.com/data_spacing.html.


USAGE

DimodalPy exposes one class, DiOpt, that sets parameters controlling the
analysis, filters, detectors, and tests, and one function, check_modality(),
that performs the analysis.  The usual flow follows:

>>> import dimodalPy.dimodalPy as dm
>>> o = dm.DiOpt()
>>> o.defaults()
>>> o.set(param1=value1, param2=value2)
>>> m = dm.check_modality(x, o)
>>> m.print()
>>> m.plot()

The data x can be a list, a 1D array, or a 1D numpy array.  The DiOpt class
has methods read() to pull parameters from a file, write() to save them,
and set() to change their values.  m is an instance of the ModalAnalysis
class with methods print() to show the results and plot() that uses
matplotlib to graph them.  It has members
  m.data     instance of DiData with raw data and filtered spacings
  m.lppeak   instance of DiPeak with local extrema in the low-pass spacing
             and peak test results
  m.lpflat   instance of DiFlat with flats in the low-pass spacing and tests
  m.diwpeak  instance of DiPeak with peaks in the interval spacing and tests
  m.diwflat  instance of DiFlat with interval spacing flats and test results
These three classes are themselves containers of Datarow, Datastat, Extremum,
Peak, and Flat instances.  They have a method to uniformly index the
individual features and data rows, while the lowest five classes contain the
information about each feature and its test results.


BUILDING

DimodalPy is part of the source code distribution for DimodalC.  Building it
requires:
  SWIG
  awk
  Python headers          (specifically, Python.h)
  numpy headers           (specifically, numpy/arrayobject.h)
  numpy source            (specifically, the numpy/tools/swig/numpy.i file)
  pip and wheel packages  (to create the binary distribution)

Edit the Makefile, changing the paths to 
  PYINCPATH     to python and numpy headers
  SWIGINCPATH   directory with numpy.i file
then
> make python

This creates a wheel file in build/dimodalPy-*.whl which can be installed
or published.  Alternatively you can work directly within the dimodalPy
directory, in which case the namespace will have no package component, ie.
>>> import dimodalPy as dm

To remove all files (python + C) do
> make clean

make without a target, or 'all', also builds the C executables DimodalC and
test_Dimodal.

DimodalPy has been developed under Linux.  In principle it should compile
under other operating systems, but this has not been done.


DEVELOPMENT

The DEVEL file has details about the implementation of dimodalPy.


CONTACT

Please contact us at
  support@primachvis.com
with bug reports, feature requests, or other comments.  We look forward to
hearing from you.
