spark_rma package¶
Submodules¶
spark_rma.median_polish module¶
Carry out median polish on grouping of probes, to return a value for each sample in that grouping.
-
class
spark_rma.median_polish.
ProbesetSummary
(spark, input_data, num_samples, repartition_number)[source]¶ Bases:
spark_rma.median_polish.Summary
Summarize probes with probeset region grouping
-
class
spark_rma.median_polish.
Summary
(spark, input_data, num_samples, repartition_number, **kwargs)[source]¶ Bases:
object
Summarize gene expression values via median polish.
-
udaf
(data)[source]¶ Apply median polish to groupBy keys and return value for each sample within that grouping.
This is a hacked/workaround user-defined aggregate function (UDAF) that passes the grouped data to python to do median polish and return the result back to the dataframe.
Returns: spark dataframe
-
-
class
spark_rma.median_polish.
TranscriptSummary
(spark, input_data, num_samples, repartition_number)[source]¶ Bases:
spark_rma.median_polish.Summary
Summarize probes with transcript cluster groupings
-
spark_rma.median_polish.
infer_grouping_and_summarize
(spark, input_file, output_file, num_samples, repartition_number)[source]¶ Read the input file to infer grouping type and select appropriate summarization class.
-
spark_rma.median_polish.
main
()[source]¶ Collect command-line arguments and start spark session when using spark-submit.
-
spark_rma.median_polish.
probe_summarization
(grouped_values)[source]¶ Summarization step to be pickled by Spark as a UDF. Receives a groupings data in a list, unpacks it, performs median polish, calculates the expression values from the median polish matrix results, packs it back up, and return it to a new spark Dataframe.
Parameters: grouped_values – a list of strings because spark concatenated all the values into one string for each sample. Each item is a sample,probe, value format and all the rows in the input belong to a grouping key that spark handled (transcript_cluster or probeset) Returns: a list of lists where each item is length two with (sample, value)
spark_rma.quantile_normalization module¶
Quantile normalize background corrected array samples.
-
spark_rma.quantile_normalization.
command_line
()[source]¶ Collect and validate command line arguments.
-
spark_rma.quantile_normalization.
infer_target_level
(data_frame)[source]¶ Read the input file to infer target type and select appropriate summarization class.
-
spark_rma.quantile_normalization.
main
()[source]¶ Gather command-line arguments and create spark session if executed with spark-submit
-
spark_rma.quantile_normalization.
normalize
(spark, input_path, output)[source]¶ Read parquet file, normalize, and write results.
Parameters: - spark – spark session object
- input_path – path to input parquet file with background corrected data
- output – path to write results
-
spark_rma.quantile_normalization.
quantile_normalize
(data_frame, target)[source]¶ Quantile normalize the data using spark window spec. This allows the ranking, average, and reassigning values with a join.
Parameters: - data_frame – spark data frame
- target – summary target defined by annotation. This is the level to which we will summarize, either probeset or transcript_cluster.
Returns: quantile normalized dataframe with sample, probe, target, and normalized value.