helper package

Submodules

helper.convert_to_parquet module

Converts files into snappy-parquet without using Spark and JVM. This is so they are compressed before putting into HDFS and/or before Spark reads them to accelerate all these tasks.

class helper.convert_to_parquet.Conversion(input_name, output_name, delimiter)[source]

Convert file to parquet.

check()[source]

Check if the output file has been written before already. During recovery or re-run, don’t waste time by re-writing the same files.

Returns:bool if path exists
execute()[source]

Check the file’s existence and convert into parquet if the output does not already exist. Public method to call on the class.

read()[source]

Read the files into Pandas dataframe, required by fastparquet. Reads based on delimiter, faster to explicitly switch between these types than to detect since it uses the c engine instead of python to read.

Returns:pandas.dataframe()
write(dataframe)[source]

Write a pandas dataframe into parquet format using fastparquet and snappy compression.

Parameters:dataframe – a pandas dataframe as input
helper.convert_to_parquet.call_conversion(in_name, out_name, mode)[source]

Create object and call method to execute conversion. This is required so that multiprocessing can pickle a function, it cannot pickle a class.

Parameters:
  • in_name – input file name in flat file, tsv or csv
  • out_name – output file name in parquet
  • mode – delimiter, comma or tab.
helper.convert_to_parquet.command_line()[source]

Collect and validate command line arguments.

helper.convert_to_parquet.makedir_if_not_exist(directory)[source]

Create a directory if it does not already exist.

Parameters:directory – (str) Directory name.
helper.convert_to_parquet.parallelizer(input_directory, output_directory, mode)[source]

Parallelizes this program to convert a whole directory of files.

Parameters:
  • input_directory – input directory of flat files
  • output_directory – name of directory to write to
  • mode – delimiter in files, currently must be the same for whole directory

Module contents