helper package¶
Submodules¶
helper.convert_to_parquet module¶
Converts files into snappy-parquet without using Spark and JVM. This is so they are compressed before putting into HDFS and/or before Spark reads them to accelerate all these tasks.
-
class
helper.convert_to_parquet.
Conversion
(input_name, output_name, delimiter)[source]¶ Convert file to parquet.
-
check
()[source]¶ Check if the output file has been written before already. During recovery or re-run, don’t waste time by re-writing the same files.
Returns: bool if path exists
-
execute
()[source]¶ Check the file’s existence and convert into parquet if the output does not already exist. Public method to call on the class.
-
-
helper.convert_to_parquet.
call_conversion
(in_name, out_name, mode)[source]¶ Create object and call method to execute conversion. This is required so that multiprocessing can pickle a function, it cannot pickle a class.
Parameters: - in_name – input file name in flat file, tsv or csv
- out_name – output file name in parquet
- mode – delimiter, comma or tab.
-
helper.convert_to_parquet.
makedir_if_not_exist
(directory)[source]¶ Create a directory if it does not already exist.
Parameters: directory – (str) Directory name.
-
helper.convert_to_parquet.
parallelizer
(input_directory, output_directory, mode)[source]¶ Parallelizes this program to convert a whole directory of files.
Parameters: - input_directory – input directory of flat files
- output_directory – name of directory to write to
- mode – delimiter in files, currently must be the same for whole directory