PyParse Function Detail#
Copyright 2023 GlaxoSmithKline Research & Development Limited
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Authors: Joe Mason, Francesco Rianjongdee, Harry Wilders, David Fallon
- PyParse.buildHTML(save_dir, compoundDF, all_compounds, impurities, analysis_name, times={})[source]#
Build a HTML output file using jinja2 and a html_template that is stored in the directory “templates”.
- Parameters:
save_dir – A string designating the output directory
compoundDF – Pandas datatable containing all information on the compounds used for analysis
all_compounds – a list of all compound names
impurities – a list of all impurity names
analysis_name – User provided name for the analysis
times – Optional parameter of a list of floats related to processing time for each step of the analysis.
- Returns:
HTML file saved to save_dir
- PyParse.findHits(compound, dataTable)[source]#
For a given compound, look at each peak in each well to find a suitable match based on m/z data.
- Parameters:
compound – A Series corresponding to a specific compound from the compoundDF pandas dataframe
dataTable – A dictionary of all peaks in all wells for the plate
- Returns:
a list of hits, where each item in the list is a dictionary
- PyParse.findImpurities(dataTable, compoundDF, save_dir, chroma)[source]#
The goal of this function is to find impurities that the program wasn’t explicitly asked to find. It will do this by searching for commonly appearing peaks, that have a clear ionisation pattern, that haven’t already been assigned.
- Parameters:
dataTable – dictionary of all peaks, indexed by well
compoundDF – pandas dataframe for all compounds
save_dir – string for output directory
chroma – a dictionary containing chromatograms, indexed by well
- Returns:
Chromatograms for each impurity, additional rows in the compoundDF, and hit validation graph plotted containing all impurity hits.
- PyParse.findOverlap(dataTable, well, time)[source]#
Provided with the dataTable and a retention time+well of interest, this function will look for any overlapping peaks, and return the retention times of those peaks to the user.
- Parameters:
dataTable – A dictionary of list of dicts, indexed by well
well – An integer to describe the well number
time – The retention time, as a float, of the peak of interest
- Returns:
A list of overlaps
- PyParse.findPotentialConflicts(compoundDF)[source]#
Find any products which are in danger of overlapping with other compounds that are expected in the same well. This function is useful as the program may not detect SM (etc) for that peak/well because the mass_conf is too low, but the user should still be warned that there may be a problem.
- Parameters:
compoundDF – Pandas datatable for compounds
- Returns:
A text output for that compound.
- PyParse.genLocationHeatmaps(cptable, save_dir)[source]#
Generate a heatmap type graph to visualise the expected locations of each compound, based on the platemap that was provided.
- Parameters:
cptable – Pandas dataframe containing compounds
save_dir – string describing the location of the output folder
- Output:
Heatmaps saved to the output folder.
- PyParse.generateMol(smiles, name, save_dir)[source]#
Generates a 2D rendering of the given structure and saves it as a .png
- Parameters:
smiles – a string (SMILES) of the compound
name – a string corresponding to name of that compound
save_dir – a string corresponding to the output directory
- Returns:
A .png rendering of the given compound
- PyParse.generateOutputTable(compoundDF, internalSTD, SMs, products, by_products, total_area_abs)[source]#
Reformats the validated hits into a pandas table ready for visualisation and export.
- Parameters:
compoundDF – The pandas datatable containing all compounds with their respective hits.
internalSTD – the name of the internalSTD
SMs – a list of indices for the starting materials
products – a list of indices for the products
by_products – a list of indices for the by-products
total_area_abs – A float corresponding to the sum of all peak_area_absolutes
- Returns:
A Pandas table named outputTable
- PyParse.getUserReadableWell(wellno)[source]#
Converts the well as a number into a user-friendly string, e.g. well 11 becomes “B5” for a 4*6 well plate
- Parameters:
wellno – An integer representing a specific well on the plate
- Returns:
A string representing a specific well on the plate
- PyParse.importStructures(filename, save_dir)[source]#
From a given CSV, with defined header names, and one row per well, deconvolute the information to give a dataframe of one compound per row, indexed by canonical SMILES. The well must be specified using a capital letter from A-Z to describe the row, and a positive integer to describve the column. Column numbers should be written as 1, 2, etc as opposed to 01, 02, etc.
- Parameters:
filename – file name and directory as a string
- Returns:
List comprising [Pandas dataframe, name of internal STD, list of names of the starting materials, list of names of the products, list of names of the byproducts]
- PyParse.main()[source]#
Provide an .rpt file and a csv containing the compounds in the plate to analyse the wells and return an output containing plots of compounds, key wells, and multiple output tables.
- PyParse.plotChroma(cpname, wellno, trace, pStart, pEnd, annotate_peaks, save_dir, ms_plus, ms_minus, mass1)[source]#
Plots the LCMS trace with labels for compounds found for a specific well, and highlights a particular peak of interest, providing m/z data for that peak.
- Parameters:
cpname – a string of the compound name
wellno – an integer representing the well
trace – a list [x-values, y-values] to plot the Uv chromatogram
pStart – a float for the time where a specific peak begins
pEnd – a float for the time where a specific peak ends.
annotate_peaks – a list of dictionaries for peaks to annotate
save_dir – a string for the output directory
ms_plus – a list [x-values, y-values] for MS+ spectrometric data
ms_minus – a list [x-values, y-values] for MS- spectrometric data
mass1 – the isotopic mass of a compound, to which +1/-1 should be added to get to an expected observed mass (typically parent isotopic mass)
- Returns:
jpg of the chromatogram saved to output directory
- PyParse.plotDonut(dataframe, save_dir)[source]#
Takes the outputTable dataframe and generates a donut chart for product %area to be saved in the output folder.
- Parameters:
dataframe – A Pandas dataframe (AKA outputTable)
save_dir – a string for the output directory
- Returns:
Saved jpg of the donut chart
- PyParse.plotHeatmaps(outputTable, save_dir)[source]#
Plots and saves heatmaps for the full dataset
- Parameters:
outputTable – a pandas datatable
save_dir – a string for the output directory
- Returns:
jpg of the heatmap saved to output directory
- PyParse.plotHistogram(dataframe, save_dir)[source]#
Takes the outputTable dataframe and generates a histogram for product %area to be saved in the output folder.
- Parameters:
dataframe – A Pandas dataframe (AKA outputTable)
save_dir – a string for the output directory
- Returns:
jpg of the histogram saved to output directory
- PyParse.plotHitValidationGraph(cpname, validatedHits, save_dir, cluster_bands)[source]#
Plots all the hit peaks in a scatter graph of peaktime vs well, colour coded by whether the hit was included or discarded from the final output.
- Parameters:
cpname – a string for the name of the compound of interest
validatedHits – a dict, where each header contains a list of dicts
save_dir – File directory for where to save the matplotlib figure
cluster_bands – A list of the average retention times for each cluster.
- Returns:
jpg of the hit validation graph saved to output directory
- PyParse.plotPieCharts(zvalue, outputTable, save_dir, by_products)[source]#
Plots a set of pie charts for the full plate using the full dataset. The size of the pie chart is dependant on the value in the datatable for the column specified by zvalue
- Parameters:
zvalue – a string corresponding to the desired output metric (e.g. P/STD)
outputTable – a pandas datatable
save_dir – a string for the output directory
by_products – a list of names of byproducts
- Returns:
jpg of the piecharts saved to output directory
- PyParse.refineClusterByMassConf(cluster, comments)[source]#
Takes in input cluster of all the hit peaks, and refines them by ensuring all peaks have a similar mass confidence to the cluster’s mean. Those which do are left in “green”; those which don’t are moved to the “orange” category.
- Parameters:
cluster – a dict, with list of dicts for each header
comments – A list of comments for the compound so far
- Returns:
List comprising [a dictionary for the refined cluster, list of comments]
- PyParse.refineClusterByTime(cluster, comments, expected_rt)[source]#
Takes in input cluster of all the hit peaks, and refines them by finding a mid-value for the retention time based on which hit has the greatest number of nearest neighbours. Sorts the best hits into “green”, uncertain ones into “orange” and those where another peak closer to the mid-value was found in the same well into “discarded”.
- Parameters:
cluster – list of dictionaries, where each dictionary is a hit
comments – A list of comments for that structure so far.
- Returns:
List comprising [a dictionary for the refined cluster, list of comments]
- PyParse.refineClusterByUV(cluster, UVdatafound, comments)[source]#
Takes in input cluster of all the hit peaks, and refines the cluster by ensuring all peaks have a similar set of UV maxima. Those which do are left in “green”, those which don’t are moved to the “orange” category
- Parameters:
cluster – a dict, with list of dicts for each header
UVdatafound – boolean for whether the rpt data contains UV data
comments – A list of comments for that structure so far
- Returns:
List comprising [a dictionary for the refined cluster, list of comments]
- PyParse.removeDupAssigns(compoundDF, internalSTD, SMs, products, by_products)[source]#
Checks each compound to ensure that no peak has been assigned to two different compounds. If this has happened, the internalSTD (if present) takes first priority, limiting reactant is second priority, product third priority and finally a by-product is lowest priority.
- Parameters:
compoundDF – Pandas dataframe
internalSTD – A string representing the name of the internal standard
SMs – A list of starting material names
products – A list of product names
by_products – A list of by_product names
- Returns:
compoundDF as Pandas dataframe
- PyParse.selectClusterByMassConf(clusters)[source]#
If more than one cluster was found for the compound, this function is called to try to select a single cluster based on which cluster has the highest mean massConf. If more than one cluster has a close-to-highest-mean massconf, take them all.
- Parameters:
clusters – a list of dictionaries, with list of dictionaries for each header
- Return refined_clusters:
a list of dictionaries, with list of dictionaries for each header
- Return discarded_clusters:
a list of dictionaries, with list of dictionaries for each header
- PyParse.selectClusterBySize(clusters)[source]#
If more than one cluster was found for the compound, this function is called to try to select a single cluster based on which cluster is the largest. If more than one cluster has a close-to-largest size, take them all.
- Parameters:
clusters – a list of dictionaries, with list of dictionaries for each header
- Return refined_clusters:
a list of dictionaries, with list of dictionaries for each header
- Return discarded_clusters:
a list of dictionaries, with list of dictionaries for each header
- PyParse.validateHits(cpname, peakList, expected_rt)[source]#
Looks at all hits for a given compound, and refines the list based on retention time, massConf and UV data. The goal is to end up with only high confidence hits, and no more than one hit per well.
- Parameters:
cpname – string of the compound name
peakList – A list of all hits for (peaks assigned to) that compound
expected_rt – An integer for the expected retention time of the compound
- Returns:
A dictionary, where each header contains a list of dictionaries (peaks), along with other useful variables to aid plot generation.