Documentation for evaluation scripts¶

evalFrame (EvalAFrame.py) script documentation¶

Parameters¶

frame_filename: file where trained frame is stored
gold_filename: xml file of examples correctly labeled for a positive or negative match of the frame, as well as annotated for attributes
verbose_frame: boolean, if True script prints debug information about Frame for each sentence. Default False
verbose_attribte: boolean, if True script prints debug information about attribute matches for each sentence. Default False
limit: int, sets cap on number of messages to print in verbose mode. Default 5.
skip: specify names of attributes to ignore in evaluation. Default None
fuzzy: bool, if True, attributes extracted which are substrings of the gold-data attribute are ignored when calculating accuracy (normally they would be considered incorrect answers). Default False.
generous: bool, only used if fuzzy is also True. If True, fuzzy answers are instead counted as correct answers. Default False.

Evaluation data format¶

Note that positive examples can also have entities labeled (in this case <Exercise></Exercise>)

<?xml version="1.0" ?><wrapper>
<positive index="43900">Had a great <Exercise>workout</Exercise>at the gym this afternoon.</positive>
<negative index="3957">Yesterday, I watched my grandfather play with my son and their bond was the most touching thing I've seen in a long time.</negative>
<positive index="1089">After my <Exercise>workout</Exercise>I exhaled with confidence in knowing I gave it all I had.</positive>
<negative index="2049">I chatted with Julia today.</negative>
</wrapper>

prepare_train_and_test_data.py documentation¶

Calls (data_to_xml or data_to_xml_with_attributes) and drop_gold_from_train to prepare gold test data and training sets that do not include it input file should have as a header:

_unit_id |frame_name |frame_confidence |locations|attribute_confidence |original_index  | text|

Takes following command line arguments: * annotated_file: positional, string filename of crowd-labeled to provide a subset that will be converted to gold set xml

–attributes, -a: boolean, default True, if True adds attribute labeling to xml
–confidence, -c:boolean, default True, if True only uses confidence values above a threhhold (default .6)
–confidence_value, -cv: float, default 0.6, set a custome threshold
–prefix, -p: Set a string file prefix for all files to be written by this script, default “../resources/prepared_data/”
–use_index, -i: boolean, default True, if false will not attempt to recompile indices
–training_file, -t: string, filename of file to be used for training that needs duplicates with test set dropped. Default “../resources/hm_indexed.csv”
–size, -s: int, set size of gold set, default 100
–tfile_index, -ti: string, default “0”, name of column containing indices in training file

drop_gold_from_train documentation¶

In order to avoid overfitting to evaluation data, we’ve provided a script called dropGold in the file frameit/drop_gold_from_train.py. Given a file with gold-data in the above XML format and a corpus for training, it returns a gold set file and a corpus file with the gold examples removed. Optionally, you can choose to have it only use a (random) subset of the gold set. Note that for this to work, the “index” values in the gold set must correspond to the correct index values in the corpus.

Parameters¶

train_file: the filename of the corpus to be used for training
positive_examples: the filename of an xml file containing gold positive examples
negative_examples: the filename of an xml file containing gold negative examples. Note that if you have positive and negative gold examples in the same file, you can pass the same file to both parameters (the script checks for whether the examples are labeled as positive or negative)
sample_size: default -1, in which case the script will use the entire gold set. If you would prefer to use a random subset of the gold set, provide an integer of the desired size.
is_random: default None, which causes the script to use a random seed for sampling. If passed an integer, it will use that integer as a seed, allowing for deterministic output.
index_column_name: default 0, set the column number for the column in the data set that contains the index.
prefix: default empty string, set the file hierarchy prefix for the desired output files (e.g. “../resources”)
file_prefix: default empty string, set any desired prefix to be added to the output filenames. This is useful if you want to label your data set with a date or other identifiable name.

data_to_xml_with_attributes.py and data_to_xml.py documentation¶

These classes converts a CSV containing labeled data into an XML file in the gold set data format. data_to_xml_with_attributes takes a csv with attribute and intent labeling as input, data_to_xml takes a csv with intent labeling but without attribute labeling as input.

Parameters data_to_xml_with_attributes.py¶

filename: the string name of the CSV file to be converted into a gold XML set
id_col: column number (int) or name (str) corresponding to a column with a unique identifier in the csv
results_col: column number (int) or name (str) corresponding to the column with frame intent labels
confidence_col: column number (int) or name (str) corresponding to a confidence value for the frame intent result
original_index_col: column number (int) or name (str) for the column containing indices
text_col: column number (int) or name (str) corresponding to a column with the actual sentence data points in string form
label_confidence_col: column number (int) or name (str) for the column containing confidence values for entity labels
confidence_threshold: float, sets a confidence threshold. Examples with labels that have lower confidence than the threshold will be ignored.
attribute_index: column number (int) or name (str) for the column containing entity labels
use_confidence: bool, if True uses the confidence_threshold. If False, all examples will be added to the gold set regardless of confidence
target: string, file prefix for output file
indices: bool. Set to True if using indices in the input file.

Parameters data_to_xml.py¶

filename: the string name of the CSV file to be converted into a gold XML set
id_col: column number (int) or name (str) corresponding to a column with a unique identifier in the csv
results_col: column number (int) or name (str) corresponding to the column with frame intent labels
confidence_col: column number (int) or name (str) corresponding to a confidence value for the frame intent result
original_index_col: column number (int) or name (str) for the column containing indices
text_col: column number (int) or name (str) corresponding to a column with the actual sentence data points in string form
confidence_threshold: float, sets a confidence threshold. Examples with labels that have lower confidence than the threshold will be ignored.
use_confidence: bool, if True uses the confidence_threshold. If False, all examples will be added to the gold set regardless of confidence
target: string, file prefix for output file

Documentation for evaluation scripts¶

evalFrame (EvalAFrame.py) script documentation¶

Parameters¶

Evaluation data format¶

prepare_train_and_test_data.py documentation¶

drop_gold_from_train documentation¶

Parameters¶

data_to_xml_with_attributes.py and data_to_xml.py documentation¶

Parameters data_to_xml_with_attributes.py¶

Parameters data_to_xml.py¶

FrameIt

Navigation

Related Topics