spark_rma package¶

Submodules¶

spark_rma.median_polish module¶

Carry out median polish on grouping of probes, to return a value for each sample in that grouping.

class spark_rma.median_polish.ProbesetSummary(spark, input_data, num_samples, repartition_number)[source]¶

Bases: spark_rma.median_polish.Summary

Summarize probes with probeset region grouping

class spark_rma.median_polish.Summary(spark, input_data, num_samples, repartition_number, **kwargs)[source]¶

Bases: object

Summarize gene expression values via median polish.

summarize()[source]¶: Summarize results across samples with median polish within defined groups.

udaf(data)[source]¶

Apply median polish to groupBy keys and return value for each sample within that grouping.

This is a hacked/workaround user-defined aggregate function (UDAF) that passes the grouped data to python to do median polish and return the result back to the dataframe.

Returns:	spark dataframe

class spark_rma.median_polish.TranscriptSummary(spark, input_data, num_samples, repartition_number)[source]¶

Bases: spark_rma.median_polish.Summary

Summarize probes with transcript cluster groupings

spark_rma.median_polish.command_line()[source]¶: Collect and validate command line arguments.

spark_rma.median_polish.infer_grouping_and_summarize(spark, input_file, output_file, num_samples, repartition_number)[source]¶: Read the input file to infer grouping type and select appropriate summarization class.

spark_rma.median_polish.main()[source]¶: Collect command-line arguments and start spark session when using spark-submit.

spark_rma.median_polish.probe_summarization(grouped_values)[source]¶

Summarization step to be pickled by Spark as a UDF. Receives a groupings data in a list, unpacks it, performs median polish, calculates the expression values from the median polish matrix results, packs it back up, and return it to a new spark Dataframe.

Parameters:	grouped_values – a list of strings because spark concatenated all the values into one string for each sample. Each item is a sample,probe, value format and all the rows in the input belong to a grouping key that spark handled (transcript_cluster or probeset)
Returns:	a list of lists where each item is length two with (sample, value)

spark_rma.quantile_normalization module¶

Quantile normalize background corrected array samples.

spark_rma.quantile_normalization.command_line()[source]¶: Collect and validate command line arguments.

spark_rma.quantile_normalization.infer_target_level(data_frame)[source]¶: Read the input file to infer target type and select appropriate summarization class.

spark_rma.quantile_normalization.main()[source]¶: Gather command-line arguments and create spark session if executed with spark-submit

spark_rma.quantile_normalization.normalize(spark, input_path, output)[source]¶

Read parquet file, normalize, and write results.

Parameters:	spark – spark session object input_path – path to input parquet file with background corrected data output – path to write results

spark_rma.quantile_normalization.quantile_normalize(data_frame, target)[source]¶

Quantile normalize the data using spark window spec. This allows the ranking, average, and reassigning values with a join.

Parameters:	data_frame – spark data frame target – summary target defined by annotation. This is the level to which we will summarize, either probeset or transcript_cluster.
Returns:	quantile normalized dataframe with sample, probe, target, and normalized value.

spark_rma package¶

Submodules¶

spark_rma.median_polish module¶

spark_rma.quantile_normalization module¶

Module contents¶