module util::Sampling
Utilities to randomly select smaller datasets from larger datasets.
Usage
import util::Sampling;
Dependencies
import util::Math;
import Map;
import List;
import Set;
Description
Sampling is important when the analysis algorithms do not scale to the size of the original corpus, or when you need to train an analysis on a representative set without overfitting on the entire corpus. These sampling functions all assume that a uniformly random selection is required.
function sample
Reduce the arity of a set by selecting a uniformly distributed sample.
set[&T] sample(set[&T] corpus, int target)
A uniform subset is computed by iterating over the set and skipping every element
with a probability of 1/(size(corpus) / target)
. This rapidly generates a new set of
expected target
size, but most probably a little smaller or larger.
Examples
rascal>import util::Sampling;
ok
rascal>sample({"a","b","c","e","f","g","h","i","j","k"}, 4)
set[str]: {"a","e","h","i","j","k"}
rascal>sample({"a","b","c","e","f","g","h","i","j","k"}, 4)
set[str]: {"b","e","g","h","j"}
rascal>sample({"a","b","c","e","f","g","h","i","j","k"}, 4)
set[str]: {"b","f","h","i","k"}
function sample
Reduce the length of a list by selecting a uniformly distributed sample.
list[&T] sample(list[&T] corpus, int target)
The random selection of elements does not change their initial order in the list.
A uniform sublist is computed by iterating over the list and skipping every element
with a probability of 1/(size(corpus) / target)
. This rapidly generates a new list of
expected target
size, but most probably a little smaller or larger.
Examples
rascal>import util::Sampling;
ok
rascal>sample([1..1000], 30)
list[int]: [28,60,74,99,105,115,139,191,194,225,238,241,254,255,295,336,369,388,390,433,442,445,456,501,564,702,707,732,775,788,815,817,818,825,880,889,900,923,978,984]
rascal>sample([1..1000], 30)
list[int]: [43,76,82,84,96,107,149,191,219,221,230,264,294,307,315,435,470,487,500,528,626,631,682,757,775,822,845,887,997]
rascal>sample([1..1000], 30)
list[int]: [11,14,25,30,122,149,161,164,249,292,314,369,372,384,402,409,417,471,483,488,512,588,592,648,690,720,734,764,867,870,940,944,982,983,999]
function sample
Reduce the size of a map by selecting a uniformly distributed sample.
map[&T,&U] sample(map[&T,&U] corpus, int target)
A uniform submap is computed by iterating over the map's keys and skipping every key
with a probability of 1/(size(corpus) / target)
. This rapidly generates a new map of
expected target
size, but most probably a little smaller or larger.