niml.encoder package¶
Submodules¶
niml.encoder.encoder module¶
- class niml.encoder.encoder.Encoder(field_types=None, data_mins=None, data_maxs=None, cyclic_flags=None, spans=None, cat_values=None, cat_overlaps=None, set_bits=None, sparsity=None, missing_val_ind=None, num_fields=None)¶
Encoder object - converts raw data into SDR format according to configuration settings
- Parameters
field_types (list) – A list of characters indicating whether data in feature is numeric or categorical, must be ‘n’ for numeric and ‘c’ for categorical. Defaults to [‘n’]*num_fields
data_mins (list) – User provided float values that represent the minimum possible value that might be presented for each feature
data_maxs (list) – User provided float values that represent the maximum possible value that might be presented for each feature
cyclic_flags – A list of True or False indicators for each feature to indicate whether the data in the column is cyclical in nature Defaults to [False]*num_fields
spans –
A list of non-negative integers for each feature to indicate encoding type. Some examples settings and their effects on how data would be encoded are:
0 = linear encoding 1 = adaptive with 1 floating bit 2 = adaptive with 2 floating bits n = adaptive with n floating bits (beware encoding grows exponentially larger and will quickly eat cpu)
Defaults to [0]*num_fields
cat_values –
A list of category value lists, each categorical feature needs to have its own list of category values, i.e., consider a 3 feature dataset with the first and last features containing categorical data, this parameter would look something like what’s shown below for each data record. For example:
[['red', 'blue', 'green'], [], ['round', 'square', 'ellipse']]
Defaults to [[]]*num_fields
cat_overlaps – A list of overlap percentages for each categorical feature as float values where 0 <= cat_overlap < 1 Defaults to [0]*num_fields
set_bits – A single global integer that applies to all features indicating the number of bits to use for each encoded value
sparsity – A single global float value that applies to all features where 0 < sparsity < 1 indicating how many empty bits to include in each encoding
missing_val_ind – Character or string of characters used within the dataset to indicate that no data is available for a feature
num_fields – The number of features or fields in the dataset. Must be set if field_types is None or an empty list. Defaults to len(field_types).
- config_encoder(data_file=None, input_data=None, label_col=0)¶
Processes data from a file name, file handle, numpy array, pandas array, or plain old python list of raw data to build the data dependent configuration for encoding
- Parameters
data_file (file_like) – Can be either a string representing a file name or an open python file handle
input_data (iterable) – Can be any array like object such as a numpy or pandas array, or a python list of lists
label_col (int, optional) – Allows the user to indicate the label location in the data, defaults to 0 if not provided. Passing ‘None’ will trigger encoding of all columns, used if data has already been split from the labels
- encode(data_file=None, input_data=None, label_col=0)¶
- Primary method of the encoder class. Takes either a file name,
file handle, numpy array, pandas array, or plain old python list of raw data and encodes it according to the configuration parameters given at the time of class instantiation
- Parameters
data_file – Can be either a string representing a file name or an open python file handle
input_data – Can be any array like object such as a numpy or pandas array, or a python list of lists
label_col – Allows the user to indicate the label location in the data, if no value is passed, the encoder will encode all columns assuming the user has already split of the label data (this is more or less required for handling numpy data)
- Returns
labels – If a label column is identified, this will contain a list of each label stripped from the data
isdrs – A list of lists representing the encoded values of the raw data based on the encoder configuration
- get_encoder_state()¶
Method that builds and returns a dictionary representing all the global and per feature state parameters as well as the data and labels if they are available
- Return type
Anonymous dictionary of the current state of the encoder object
- sdrs_to_bin(out_file, sdr_list, width, set_bits, id_string='loki_isdrs', fill_value=4294967295)¶
Writes SDR_LIST to OUT_FILE in binary format
- args:
out_file can be either a file-type object supporting ‘write’ or it can be a string. If a string, a file-type object will be created.
- sdr_list:
a list of lists. Each internal list is a list of SDR offsets. Typically the internal lists are all the same size, but in the case of ‘missing values’ some lists might be shorter than others
- width:
the maximum offset value in an sdr_list
- set_bits:
the maximum number of set bits in each sdr. Typically SDRs contain this many values, but in the case of “missing values” there will be fewer values. In this case fill_values will be used to pad the sdr so it contains set_bits number of entries
- id_string:
the id string written in the file header. Has no impact to the algorithm or the code
- fill_value:
the value used to supplement ‘missing values’ observations so they contain the same number of set-bits as “non missing values” observations
- static write_16_bits(ba_target, value, idx)¶
Writes a 16-bit version of VALUE to ba_target, then returns the new idx for ba_target
- static write_32_bits(ba_target, value, idx)¶
Writes a 32-bit version of VALUE to ba_target, then returns the new idx for ba_target