An Efficient, Optimal Binning of Data
We have a project where we need to reduce a large set of measurement values into a small number of discrete bins. How can we best define the bins and a typical value to represent them, and then efficiently find the breaks?
Sample data to bin into discrete buckets. Where to put the thresholds?
Our dataset will contain values; let it be sorted so that we can split it into contiguous bins. In other words, a bin will be defined by an upper value and its members will be the data , where . In the sorted data this corresponds to a range defined by its upper end point index and the members will have indices where (indices here are 1-based). We will represent the bin by some value and can use the sum of the squared differences between the points in the bin and this value as the metric to optimize when choosing end points. This means we want to minimize the Squared Error
The mean over the bin is the best estimate for , as you can see by differentiating SE by , setting the equation to 0, and solving
using to count the data points in the bin. We want to minimize SE over all bins, or, summing,
Instead we could use the Mean Squared Error
This is also the variance within the bin.
We can test all possible partitions to find the one with the smallest (M)SE, but this is expensive. There are permutations of the indices taken at a time, but we also require the end points be increasing, so only one of the possible ways to order the selection is valid. In other words, the total count of possibilities is the combination . The exhaustive search is and is unusable for .
We can do better with a recursive process. Consider the partial metric for a range of indices , inclusive.
We first calculate this anchored to the right side by the last point, for all possible starting points
The best two-bin partition starting at is to an intermediate index that minimizes the metric
where and .
The three-bin partition finds the best combination of a partial range on the left and the anchored two-bin range on the right.
This continues until we have all the bins we need. Note that we only need calculate for the final bin. The endpoints give the break points for the binning, and the corresponding values the thresholds.
Calculation of next bins. Light brown boxes are fixed by previous recursion. Partitions with fewer than two points per bin are omitted.
The advantage we gain is that we discard all but the best partition before going to the next higher level of recursion. The complexity is , where we take advantage of calculating only for the final bin and there are roughly starting and intermediate points, reduced a few because we require two points per bin.
We can compute the average efficiently for all ranges by making a cumulative sum over x and taking the difference between the end point and just prior to the start of the range, then dividing by number of points. This sum need only be calculated once for the sorted data, so the average reduces to a subtraction and division. We can also make a cumulative sum of the to calculate the (M)SE quickly, although this one-pass approach may introduce inaccuracies from floating point precision if the data has very small and very large values, or for very large N.
With enough memory we can cache the values; because the matrix would be upper triangular and require elements. A 2 GB cache is large enough for 64 K points and speeds runtimes by a factor of 3-5.
Our approach is one of a class of ways of partitioning data. The PELT algorithm used to find changepoints (Killick et al., "Optimal Detection of Changepoints with a Linear Computational Cost", on arXiv) adds a screening test to rule out intermediate points, which in the best case turns the recursion into an cost. On our test data we never saw this gain. Putting the screen in the outer loop rejected too few intermediate points to compensate for the test's extra time. Screening in the inner loop was more successful, but here the calculations are so simple that any additional test will increase runtime.
The R package optbin, available on CRAN or as a source tarball, implements this algorithm. The function
bins <- optbin(x, numbin, metric=c('se', 'mse'))
finds the best partition. It has other arguments to limit the cache size (switching to the slower, direct calculation if necessary) and to remove NA elements; see the function's help page. It returns an object of class 'optbin' with the upper bin thresholds, the breakpoint indices in the sorted array, the average value within each bin, and the (M)SE metric. You can apply a partition to new data (or the unsorted input) with
assign.optbin(x, optbin, extend.upper=FALSE)
which returns a list of bin assignments for each data point. Points outside the original range are set to NA unless the extend.upper argument is TRUE, in which case they are put in the last bin. Remember that the first bin has no bottom to extend. You can plot the binning with
plot(optbin)
which shows the sorted data color-coded by bin assignment, or
hist(optbin, bincol=NULL)
which shows a histogram of the data with binning indicated with vertical lines and bars under the histogram colored per the bincol vector or an internal color set, and averages superimposed.
When dividing the example data into three bins, the thresholds fall at the steep transitions in the sorted data (in the left graph) and near the minima of the histogram (grey lines in the right graph, the colored lines are drawn at the bin's value). Adding another bucket divides the middle bin, near a small minima in the histogram. The number of bins is fixed for our application, but you can see that if it were not then we would have to manually determine the best division. This is a problem that many clustering algorithms, notably k-means, suffer, and there is no good way to automatically determine the number of bins.
Data binned into 3 buckets, seen in sorted values (left) and histogram (right).
Data binned into 4 buckets, seen in sorted values (left) and histogram (right).
Optimizing on the MSE may choose more natural breakpoints. Seven bins continues to sub-divide buckets in the middle using SE; in the left graph you can see additional breaks in the left and right buckets of the 3 binning, plus two in the middle. With MSE the thresholds lie at local minima (right graph), except for the fifth. In particular the long, low tail at high values has been split off.
Data binned into 7 buckets by SE (left) and MSE (right).