niml.encoder package

Submodules

niml.encoder.encoder module

class niml.encoder.encoder.Encoder(field_types=None, data_mins=None, data_maxs=None, cyclic_flags=None, spans=None, cat_values=None, cat_overlaps=None, set_bits=None, sparsity=None, missing_val_ind=None, num_fields=None)

Encoder object - converts raw data into SDR format according to configuration settings

Parameters
  • field_types (list) – A list of characters indicating whether data in feature is numeric or categorical, must be ‘n’ for numeric and ‘c’ for categorical. Defaults to [‘n’]*num_fields

  • data_mins (list) – User provided float values that represent the minimum possible value that might be presented for each feature

  • data_maxs (list) – User provided float values that represent the maximum possible value that might be presented for each feature

  • cyclic_flags – A list of True or False indicators for each feature to indicate whether the data in the column is cyclical in nature Defaults to [False]*num_fields

  • spans

    A list of non-negative integers for each feature to indicate encoding type. Some examples settings and their effects on how data would be encoded are:

    0 = linear encoding
    1 = adaptive with 1 floating bit
    2 = adaptive with 2 floating bits
    n = adaptive with n floating bits (beware encoding grows
        exponentially larger and will quickly eat cpu)
    

    Defaults to [0]*num_fields

  • cat_values

    A list of category value lists, each categorical feature needs to have its own list of category values, i.e., consider a 3 feature dataset with the first and last features containing categorical data, this parameter would look something like what’s shown below for each data record. For example:

    [['red', 'blue', 'green'], [], ['round', 'square', 'ellipse']]
    

    Defaults to [[]]*num_fields

  • cat_overlaps – A list of overlap percentages for each categorical feature as float values where 0 <= cat_overlap < 1 Defaults to [0]*num_fields

  • set_bits – A single global integer that applies to all features indicating the number of bits to use for each encoded value

  • sparsity – A single global float value that applies to all features where 0 < sparsity < 1 indicating how many empty bits to include in each encoding

  • missing_val_ind – Character or string of characters used within the dataset to indicate that no data is available for a feature

  • num_fields – The number of features or fields in the dataset. Must be set if field_types is None or an empty list. Defaults to len(field_types).

config_encoder(data_file=None, input_data=None, label_col=0)

Processes data from a file name, file handle, numpy array, pandas array, or plain old python list of raw data to build the data dependent configuration for encoding

Parameters
  • data_file (file_like) – Can be either a string representing a file name or an open python file handle

  • input_data (iterable) – Can be any array like object such as a numpy or pandas array, or a python list of lists

  • label_col (int, optional) – Allows the user to indicate the label location in the data, defaults to 0 if not provided. Passing ‘None’ will trigger encoding of all columns, used if data has already been split from the labels

encode(data_file=None, input_data=None, label_col=0)
Primary method of the encoder class. Takes either a file name,

file handle, numpy array, pandas array, or plain old python list of raw data and encodes it according to the configuration parameters given at the time of class instantiation

Parameters
  • data_file – Can be either a string representing a file name or an open python file handle

  • input_data – Can be any array like object such as a numpy or pandas array, or a python list of lists

  • label_col – Allows the user to indicate the label location in the data, if no value is passed, the encoder will encode all columns assuming the user has already split of the label data (this is more or less required for handling numpy data)

Returns

  • labels – If a label column is identified, this will contain a list of each label stripped from the data

  • isdrs – A list of lists representing the encoded values of the raw data based on the encoder configuration

get_encoder_state()

Method that builds and returns a dictionary representing all the global and per feature state parameters as well as the data and labels if they are available

Return type

Anonymous dictionary of the current state of the encoder object

sdrs_to_bin(out_file, sdr_list, width, set_bits, id_string='loki_isdrs', fill_value=4294967295)

Writes SDR_LIST to OUT_FILE in binary format

args:

out_file can be either a file-type object supporting ‘write’ or it can be a string. If a string, a file-type object will be created.

sdr_list:

a list of lists. Each internal list is a list of SDR offsets. Typically the internal lists are all the same size, but in the case of ‘missing values’ some lists might be shorter than others

width:

the maximum offset value in an sdr_list

set_bits:

the maximum number of set bits in each sdr. Typically SDRs contain this many values, but in the case of “missing values” there will be fewer values. In this case fill_values will be used to pad the sdr so it contains set_bits number of entries

id_string:

the id string written in the file header. Has no impact to the algorithm or the code

fill_value:

the value used to supplement ‘missing values’ observations so they contain the same number of set-bits as “non missing values” observations

static write_16_bits(ba_target, value, idx)

Writes a 16-bit version of VALUE to ba_target, then returns the new idx for ba_target

static write_32_bits(ba_target, value, idx)

Writes a 32-bit version of VALUE to ba_target, then returns the new idx for ba_target