spark_rma package

Submodules

spark_rma.median_polish module

Carry out median polish on grouping of probes, to return a value for each sample in that grouping.

class spark_rma.median_polish.ProbesetSummary(spark, input_data, num_samples, repartition_number)[source]

Bases: spark_rma.median_polish.Summary

Summarize probes with probeset region grouping

class spark_rma.median_polish.Summary(spark, input_data, num_samples, repartition_number, **kwargs)[source]

Bases: object

Summarize gene expression values via median polish.

summarize()[source]

Summarize results across samples with median polish within defined groups.

udaf(data)[source]

Apply median polish to groupBy keys and return value for each sample within that grouping.

This is a hacked/workaround user-defined aggregate function (UDAF) that passes the grouped data to python to do median polish and return the result back to the dataframe.

Returns:spark dataframe
class spark_rma.median_polish.TranscriptSummary(spark, input_data, num_samples, repartition_number)[source]

Bases: spark_rma.median_polish.Summary

Summarize probes with transcript cluster groupings

spark_rma.median_polish.command_line()[source]

Collect and validate command line arguments.

spark_rma.median_polish.infer_grouping_and_summarize(spark, input_file, output_file, num_samples, repartition_number)[source]

Read the input file to infer grouping type and select appropriate summarization class.

spark_rma.median_polish.main()[source]

Collect command-line arguments and start spark session when using spark-submit.

spark_rma.median_polish.probe_summarization(grouped_values)[source]

Summarization step to be pickled by Spark as a UDF. Receives a groupings data in a list, unpacks it, performs median polish, calculates the expression values from the median polish matrix results, packs it back up, and return it to a new spark Dataframe.

Parameters:grouped_values – a list of strings because spark concatenated all the values into one string for each sample. Each item is a sample,probe, value format and all the rows in the input belong to a grouping key that spark handled (transcript_cluster or probeset)
Returns:a list of lists where each item is length two with (sample, value)

spark_rma.quantile_normalization module

Quantile normalize background corrected array samples.

spark_rma.quantile_normalization.command_line()[source]

Collect and validate command line arguments.

spark_rma.quantile_normalization.infer_target_level(data_frame)[source]

Read the input file to infer target type and select appropriate summarization class.

spark_rma.quantile_normalization.main()[source]

Gather command-line arguments and create spark session if executed with spark-submit

spark_rma.quantile_normalization.normalize(spark, input_path, output)[source]

Read parquet file, normalize, and write results.

Parameters:
  • spark – spark session object
  • input_path – path to input parquet file with background corrected data
  • output – path to write results
spark_rma.quantile_normalization.quantile_normalize(data_frame, target)[source]

Quantile normalize the data using spark window spec. This allows the ranking, average, and reassigning values with a join.

Parameters:
  • data_frame – spark data frame
  • target – summary target defined by annotation. This is the level to which we will summarize, either probeset or transcript_cluster.
Returns:

quantile normalized dataframe with sample, probe, target, and normalized value.

Module contents