PyParse Function Detail#

Copyright 2023 GlaxoSmithKline Research & Development Limited

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Authors: Joe Mason, Francesco Rianjongdee, Harry Wilders, David Fallon

PyParse.buildHTML(save_dir, compoundDF, all_compounds, impurities, analysis_name, times={})[source]#

Build a HTML output file using jinja2 and a html_template that is stored in the directory “templates”.

Parameters:
  • save_dir – A string designating the output directory

  • compoundDF – Pandas datatable containing all information on the compounds used for analysis

  • all_compounds – a list of all compound names

  • impurities – a list of all impurity names

  • analysis_name – User provided name for the analysis

  • times – Optional parameter of a list of floats related to processing time for each step of the analysis.

Returns:

HTML file saved to save_dir

PyParse.findHits(compound, dataTable)[source]#

For a given compound, look at each peak in each well to find a suitable match based on m/z data.

Parameters:
  • compound – A Series corresponding to a specific compound from the compoundDF pandas dataframe

  • dataTable – A dictionary of all peaks in all wells for the plate

Returns:

a list of hits, where each item in the list is a dictionary

PyParse.findImpurities(dataTable, compoundDF, save_dir, chroma)[source]#

The goal of this function is to find impurities that the program wasn’t explicitly asked to find. It will do this by searching for commonly appearing peaks, that have a clear ionisation pattern, that haven’t already been assigned.

Parameters:
  • dataTable – dictionary of all peaks, indexed by well

  • compoundDF – pandas dataframe for all compounds

  • save_dir – string for output directory

  • chroma – a dictionary containing chromatograms, indexed by well

Returns:

Chromatograms for each impurity, additional rows in the compoundDF, and hit validation graph plotted containing all impurity hits.

PyParse.findOverlap(dataTable, well, time)[source]#

Provided with the dataTable and a retention time+well of interest, this function will look for any overlapping peaks, and return the retention times of those peaks to the user.

Parameters:
  • dataTable – A dictionary of list of dicts, indexed by well

  • well – An integer to describe the well number

  • time – The retention time, as a float, of the peak of interest

Returns:

A list of overlaps

PyParse.findPotentialConflicts(compoundDF)[source]#

Find any products which are in danger of overlapping with other compounds that are expected in the same well. This function is useful as the program may not detect SM (etc) for that peak/well because the mass_conf is too low, but the user should still be warned that there may be a problem.

Parameters:

compoundDF – Pandas datatable for compounds

Returns:

A text output for that compound.

PyParse.genLocationHeatmaps(cptable, save_dir)[source]#

Generate a heatmap type graph to visualise the expected locations of each compound, based on the platemap that was provided.

Parameters:
  • cptable – Pandas dataframe containing compounds

  • save_dir – string describing the location of the output folder

Output:

Heatmaps saved to the output folder.

PyParse.generateMol(smiles, name, save_dir)[source]#

Generates a 2D rendering of the given structure and saves it as a .png

Parameters:
  • smiles – a string (SMILES) of the compound

  • name – a string corresponding to name of that compound

  • save_dir – a string corresponding to the output directory

Returns:

A .png rendering of the given compound

PyParse.generateOutputTable(compoundDF, internalSTD, SMs, products, by_products, total_area_abs)[source]#

Reformats the validated hits into a pandas table ready for visualisation and export.

Parameters:
  • compoundDF – The pandas datatable containing all compounds with their respective hits.

  • internalSTD – the name of the internalSTD

  • SMs – a list of indices for the starting materials

  • products – a list of indices for the products

  • by_products – a list of indices for the by-products

  • total_area_abs – A float corresponding to the sum of all peak_area_absolutes

Returns:

A Pandas table named outputTable

PyParse.getUserReadableWell(wellno)[source]#

Converts the well as a number into a user-friendly string, e.g. well 11 becomes “B5” for a 4*6 well plate

Parameters:

wellno – An integer representing a specific well on the plate

Returns:

A string representing a specific well on the plate

PyParse.importStructures(filename, save_dir)[source]#

From a given CSV, with defined header names, and one row per well, deconvolute the information to give a dataframe of one compound per row, indexed by canonical SMILES. The well must be specified using a capital letter from A-Z to describe the row, and a positive integer to describve the column. Column numbers should be written as 1, 2, etc as opposed to 01, 02, etc.

Parameters:

filename – file name and directory as a string

Returns:

List comprising [Pandas dataframe, name of internal STD, list of names of the starting materials, list of names of the products, list of names of the byproducts]

PyParse.main()[source]#

Provide an .rpt file and a csv containing the compounds in the plate to analyse the wells and return an output containing plots of compounds, key wells, and multiple output tables.

PyParse.plotChroma(cpname, wellno, trace, pStart, pEnd, annotate_peaks, save_dir, ms_plus, ms_minus, mass1)[source]#

Plots the LCMS trace with labels for compounds found for a specific well, and highlights a particular peak of interest, providing m/z data for that peak.

Parameters:
  • cpname – a string of the compound name

  • wellno – an integer representing the well

  • trace – a list [x-values, y-values] to plot the Uv chromatogram

  • pStart – a float for the time where a specific peak begins

  • pEnd – a float for the time where a specific peak ends.

  • annotate_peaks – a list of dictionaries for peaks to annotate

  • save_dir – a string for the output directory

  • ms_plus – a list [x-values, y-values] for MS+ spectrometric data

  • ms_minus – a list [x-values, y-values] for MS- spectrometric data

  • mass1 – the isotopic mass of a compound, to which +1/-1 should be added to get to an expected observed mass (typically parent isotopic mass)

Returns:

jpg of the chromatogram saved to output directory

PyParse.plotDonut(dataframe, save_dir)[source]#

Takes the outputTable dataframe and generates a donut chart for product %area to be saved in the output folder.

Parameters:
  • dataframe – A Pandas dataframe (AKA outputTable)

  • save_dir – a string for the output directory

Returns:

Saved jpg of the donut chart

PyParse.plotHeatmaps(outputTable, save_dir)[source]#

Plots and saves heatmaps for the full dataset

Parameters:
  • outputTable – a pandas datatable

  • save_dir – a string for the output directory

Returns:

jpg of the heatmap saved to output directory

PyParse.plotHistogram(dataframe, save_dir)[source]#

Takes the outputTable dataframe and generates a histogram for product %area to be saved in the output folder.

Parameters:
  • dataframe – A Pandas dataframe (AKA outputTable)

  • save_dir – a string for the output directory

Returns:

jpg of the histogram saved to output directory

PyParse.plotHitValidationGraph(cpname, validatedHits, save_dir, cluster_bands)[source]#

Plots all the hit peaks in a scatter graph of peaktime vs well, colour coded by whether the hit was included or discarded from the final output.

Parameters:
  • cpname – a string for the name of the compound of interest

  • validatedHits – a dict, where each header contains a list of dicts

  • save_dir – File directory for where to save the matplotlib figure

  • cluster_bands – A list of the average retention times for each cluster.

Returns:

jpg of the hit validation graph saved to output directory

PyParse.plotPieCharts(zvalue, outputTable, save_dir, by_products)[source]#

Plots a set of pie charts for the full plate using the full dataset. The size of the pie chart is dependant on the value in the datatable for the column specified by zvalue

Parameters:
  • zvalue – a string corresponding to the desired output metric (e.g. P/STD)

  • outputTable – a pandas datatable

  • save_dir – a string for the output directory

  • by_products – a list of names of byproducts

Returns:

jpg of the piecharts saved to output directory

PyParse.refineClusterByMassConf(cluster, comments)[source]#

Takes in input cluster of all the hit peaks, and refines them by ensuring all peaks have a similar mass confidence to the cluster’s mean. Those which do are left in “green”; those which don’t are moved to the “orange” category.

Parameters:
  • cluster – a dict, with list of dicts for each header

  • comments – A list of comments for the compound so far

Returns:

List comprising [a dictionary for the refined cluster, list of comments]

PyParse.refineClusterByTime(cluster, comments, expected_rt)[source]#

Takes in input cluster of all the hit peaks, and refines them by finding a mid-value for the retention time based on which hit has the greatest number of nearest neighbours. Sorts the best hits into “green”, uncertain ones into “orange” and those where another peak closer to the mid-value was found in the same well into “discarded”.

Parameters:
  • cluster – list of dictionaries, where each dictionary is a hit

  • comments – A list of comments for that structure so far.

Returns:

List comprising [a dictionary for the refined cluster, list of comments]

PyParse.refineClusterByUV(cluster, UVdatafound, comments)[source]#

Takes in input cluster of all the hit peaks, and refines the cluster by ensuring all peaks have a similar set of UV maxima. Those which do are left in “green”, those which don’t are moved to the “orange” category

Parameters:
  • cluster – a dict, with list of dicts for each header

  • UVdatafound – boolean for whether the rpt data contains UV data

  • comments – A list of comments for that structure so far

Returns:

List comprising [a dictionary for the refined cluster, list of comments]

PyParse.removeDupAssigns(compoundDF, internalSTD, SMs, products, by_products)[source]#

Checks each compound to ensure that no peak has been assigned to two different compounds. If this has happened, the internalSTD (if present) takes first priority, limiting reactant is second priority, product third priority and finally a by-product is lowest priority.

Parameters:
  • compoundDF – Pandas dataframe

  • internalSTD – A string representing the name of the internal standard

  • SMs – A list of starting material names

  • products – A list of product names

  • by_products – A list of by_product names

Returns:

compoundDF as Pandas dataframe

PyParse.selectClusterByMassConf(clusters)[source]#

If more than one cluster was found for the compound, this function is called to try to select a single cluster based on which cluster has the highest mean massConf. If more than one cluster has a close-to-highest-mean massconf, take them all.

Parameters:

clusters – a list of dictionaries, with list of dictionaries for each header

Return refined_clusters:

a list of dictionaries, with list of dictionaries for each header

Return discarded_clusters:

a list of dictionaries, with list of dictionaries for each header

PyParse.selectClusterBySize(clusters)[source]#

If more than one cluster was found for the compound, this function is called to try to select a single cluster based on which cluster is the largest. If more than one cluster has a close-to-largest size, take them all.

Parameters:

clusters – a list of dictionaries, with list of dictionaries for each header

Return refined_clusters:

a list of dictionaries, with list of dictionaries for each header

Return discarded_clusters:

a list of dictionaries, with list of dictionaries for each header

PyParse.validateHits(cpname, peakList, expected_rt)[source]#

Looks at all hits for a given compound, and refines the list based on retention time, massConf and UV data. The goal is to end up with only high confidence hits, and no more than one hit per well.

Parameters:
  • cpname – string of the compound name

  • peakList – A list of all hits for (peaks assigned to) that compound

  • expected_rt – An integer for the expected retention time of the compound

Returns:

A dictionary, where each header contains a list of dictionaries (peaks), along with other useful variables to aid plot generation.