helper package¶

Submodules¶

helper.convert_to_parquet module¶

Converts files into snappy-parquet without using Spark and JVM. This is so they are compressed before putting into HDFS and/or before Spark reads them to accelerate all these tasks.

class helper.convert_to_parquet.Conversion(input_name, output_name, delimiter)[source]¶

Convert file to parquet.

check()[source]¶

Check if the output file has been written before already. During recovery or re-run, don’t waste time by re-writing the same files.

Returns:	bool if path exists

execute()[source]¶: Check the file’s existence and convert into parquet if the output does not already exist. Public method to call on the class.

read()[source]¶

Read the files into Pandas dataframe, required by fastparquet. Reads based on delimiter, faster to explicitly switch between these types than to detect since it uses the c engine instead of python to read.

Returns:	pandas.dataframe()

write(dataframe)[source]¶

Write a pandas dataframe into parquet format using fastparquet and snappy compression.

Parameters:	dataframe – a pandas dataframe as input

helper.convert_to_parquet.call_conversion(in_name, out_name, mode)[source]¶

Create object and call method to execute conversion. This is required so that multiprocessing can pickle a function, it cannot pickle a class.

Parameters:	in_name – input file name in flat file, tsv or csv out_name – output file name in parquet mode – delimiter, comma or tab.

helper.convert_to_parquet.command_line()[source]¶: Collect and validate command line arguments.

helper.convert_to_parquet.makedir_if_not_exist(directory)[source]¶

Create a directory if it does not already exist.

Parameters:	directory – (str) Directory name.

helper.convert_to_parquet.parallelizer(input_directory, output_directory, mode)[source]¶

Parallelizes this program to convert a whole directory of files.

Parameters:	input_directory – input directory of flat files output_directory – name of directory to write to mode – delimiter in files, currently must be the same for whole directory

helper package¶

Submodules¶

helper.convert_to_parquet module¶

Module contents¶