The md_harmonize API Reference¶
This package includes the following modules.
md_harmonize.compound¶
This module provides the Atom
class, the Bond
class,
and the Compound
class to construct a compound entity. Most of the instance
variables of these three classes are based on CTFile fields.
-
class
md_harmonize.compound.
Atom
(atom_symbol: str, atom_number: int, x: str = '0', y: str = '0', z: str = '0', mass_difference: str = '0', charge: str = '0', atom_stereo_parity: str = '0', hydrogen_count: str = '0', stereo_care_box: str = '0', valence: str = '0', h0designator: str = '0', atom_atom_mapping_number: str = '0', inversion_retention_flag: str = '0', exact_change_flag: str = '0', kat: str = '', in_cycle: bool = False)[source]¶ Atom class describes the
Atom
entity in the compound.Atom initializer.
- Parameters
atom_symbol – atom_symbol.
atom_number – atom_number.
x – the atom x coordinate.
y – the atom y coordinate.
z – the atom z coordinate.
mass_difference – difference from mass in periodic table.
charge – charge.
atom_stereo_parity – atom stereo parity.
hydrogen_count – hydrogen_count.
stereo_care_box – stereo_care_box.
valence – valence.
h0designator – h0designator (obsolete CTFile parameter).
atom_atom_mapping_number – atom_atom_mapping_number.
inversion_retention_flag – inversion_retention_flag.
exact_change_flag – exact_change_flag.
kat – KEGG atom type.
in_cycle – whether the atom is in cycle.
-
__init__
(atom_symbol: str, atom_number: int, x: str = '0', y: str = '0', z: str = '0', mass_difference: str = '0', charge: str = '0', atom_stereo_parity: str = '0', hydrogen_count: str = '0', stereo_care_box: str = '0', valence: str = '0', h0designator: str = '0', atom_atom_mapping_number: str = '0', inversion_retention_flag: str = '0', exact_change_flag: str = '0', kat: str = '', in_cycle: bool = False)[source]¶ Atom initializer.
- Parameters
atom_symbol – atom_symbol.
atom_number – atom_number.
x – the atom x coordinate.
y – the atom y coordinate.
z – the atom z coordinate.
mass_difference – difference from mass in periodic table.
charge – charge.
atom_stereo_parity – atom stereo parity.
hydrogen_count – hydrogen_count.
stereo_care_box – stereo_care_box.
valence – valence.
h0designator – h0designator (obsolete CTFile parameter).
atom_atom_mapping_number – atom_atom_mapping_number.
inversion_retention_flag – inversion_retention_flag.
exact_change_flag – exact_change_flag.
kat – KEGG atom type.
in_cycle – whether the atom is in cycle.
-
update_symbol
(symbol: str) → str[source]¶ To update the atom symbol.
- Parameters
symbol – the updated atom symbol.
- Returns
the updated atom_symbol.
-
update_atom_number
(index: int) → int[source]¶ To update the atom number.
- Parameters
index – the updated atom number.
- Returns
the updated atom number.
-
remove_neighbors
(neighbors: list) → list[source]¶ To remove neighbors from the atom.
- Parameters
neighbors – the list of neighbors that will be removed from this atom.
- Returns
the updated list of neighbors.
-
add_neighbors
(neighbors: list) → list[source]¶ To add neighbors to the atom.
- Parameters
neighbors – the list of neighbors that will be added to this atom.
- Returns
the updated list of neighbors.
-
update_stereochemistry
(stereo: str) → str[source]¶ To update the atom stereochemistry.
- Parameters
stereo – the updated atom stereochemistry.
- Returns
the updated atom stereochemistry.
-
color_atom
(isotope_resolved: bool = False, charge: bool = False, atom_stereo: bool = False) → str[source]¶ To generate the atom color of the zero layer.
- Parameters
isotope_resolved – If true, add isotope information when constructing colors.
charge – If true, add charge information when constructing colors.
atom_stereo – If true, add atom stereochemistry information when constructing colors.
- Returns
the atom color of the zero layer.
-
update_kat
(kat: str) → str[source]¶ To update the atom KEGG atom type.
- Parameters
kat – the KEGG atom type for this atom,
- Returns
the updated KEGG atom type.
-
class
md_harmonize.compound.
Bond
(first_atom_number: str, second_atom_number: str, bond_type: str, bond_stereo: str = '0', bond_topology: str = '0', reacting_center_status: str = '0')[source]¶ Bond class describes the
Bond
entity in the compound.Bond initializer.
- Parameters
first_atom_number – the index of the first atom forming this bond.
second_atom_number – the index of the second atom forming this bond.
bond_type – the bond type. (1 = Single, 2 = Double, 3 = Triple, 4 = Aromatic, 5 = Single or Double, 6 = Single or Aromatic, 7 = double or Aromatic 8 = Any)
bond_stereo – the bond stereo. (Single bonds: 0 = not stereo, 1 = Up, 4 = Either, 6 = Down; Double bonds: determined by x, y, z coordinates)
bond_topology – bond topology. (O = Either, 1 = Ring, 2 = Chain)
reacting_center_status – reacting center status.
-
__init__
(first_atom_number: str, second_atom_number: str, bond_type: str, bond_stereo: str = '0', bond_topology: str = '0', reacting_center_status: str = '0')[source]¶ Bond initializer.
- Parameters
first_atom_number – the index of the first atom forming this bond.
second_atom_number – the index of the second atom forming this bond.
bond_type – the bond type. (1 = Single, 2 = Double, 3 = Triple, 4 = Aromatic, 5 = Single or Double, 6 = Single or Aromatic, 7 = double or Aromatic 8 = Any)
bond_stereo – the bond stereo. (Single bonds: 0 = not stereo, 1 = Up, 4 = Either, 6 = Down; Double bonds: determined by x, y, z coordinates)
bond_topology – bond topology. (O = Either, 1 = Ring, 2 = Chain)
reacting_center_status – reacting center status.
-
update_bond_type
(bond_type: str) → str[source]¶ To update the bond type.
- Parameters
bond_type – the updated bond type.
- Returns
the updated bond type.
-
update_stereochemistry
(stereo: str) → str[source]¶ To update the bond stereochemistry.
- Parameters
stereo – the updated bond stereochemistry.
- Returns
the updated bond stereochemistry.
-
update_first_atom
(index: int) → int[source]¶ To update the first atom number of the bond.
- Parameters
index – the updated first atom number.
- Returns
the updated first atom number.
-
class
md_harmonize.compound.
Compound
(compound_name: str, atoms: list, bonds: list)[source]¶ Compound class describes the
Compound
entity.Compound initializer.
- Parameters
-
property
name
¶ To get the compound name.
- Returns
the compound name.
-
static
molfile_name
(molfile: str)[source]¶ Create the compound entity based on the molfile representation.
- Parameters
molfile – the filename of the molfile.
- Returns
the constructed compound entity.
-
property
formula
¶ To construct the formula of this compound (only consider heavy atoms).
- Returns
string formula of the compound.
-
property
composition
¶ To get the atom symbols and bond types in the compound.
- Returns
the atom and bond information of the compound
-
property
r_groups
¶ To get all the R groups in the compound.
- Returns
the list of index of all the R groups.
-
contains_r_groups
() → bool[source]¶ To check if the compound contains R group(s).
- Returns
bool whether the compound contains R group.
-
has_isolated_atoms
() → bool[source]¶ To check if the compound has atoms that have no connections to other atoms.
- Returns
bool whether the compound has isolated atoms.
-
property
metal_index
¶ To get the metal elements in the compound.
- Returns
a list of atom numbers of metal elements.
-
property
h_index
¶ To get all H in the compound.
- Returns
a list of atom numbers corresponding to H.
-
property
heavy_atoms
¶ To get all the heavy atoms in the compound.
- Returns
a list of atom numbers corresponding to heavy atoms.
-
property
index_of_heavy_atoms
¶ To map the atom number to index in the heavy atom list.
- Returns
the dictionary of atom number to atom index of heavy atoms.
-
color_groups
(excluded=None) → dict[source]¶ To update the compound color groups after coloring.
- Returns
the dictionary of atom color with the list of atom number.
-
detect_abnormal_atom
() → dict[source]¶ To find the atoms with invalid bond counts.
- Returns
a list of atom numbers with invalid bond counts.
-
update_aromatic_bond_type
(cycles: list) → None[source]¶ Update the aromatic bond types. Two cases: 1) change the bond in the aromatic ring to aromatic bond (bond type = 4); 2) change the double bond connecting to the aromatic ring to single bond.
- Parameters
cycles – the list of cycles represented by aromatic atom index.
- Returns
None.
-
extract_double_bond_connecting_cycle
(atom_in_cycle: list) → list[source]¶ To extract the double bonds connecting to the atom in the aromatic cycles.
- Parameters
atom_in_cycle – the list of aromatic cycles represented by aromatic atom index.
- Returns
the list of outside double bond connecting to the atom in the aromatic cycles.
-
extract_aromatic_bonds
(cycle: list) → list[source]¶ Extract the aromatic bonds based on the atoms in the cycle.
- Parameters
cycle – the list of aromatic cycles represented by aromatic atom index.
- Returns
the list of aromatic bonds.
-
separate_connected_components
(bonds: Union[list, set]) → list[source]¶ This is used in constructing the aromatic substructures detected by the Indigo method. A compound can have several disjoint aromatic substructures. Here, we need to find the disjoint parts. The basic idea is union-find. We union atoms that are connected by a bond.
- Parameters
bonds – the list of bonds representing by the atom numbers forming the bond.
- Returns
a list of separate components represented by a list atom numbers in the component.
-
connected_components
() → dict[source]¶ Detect the connected components in the compound structure (using the breadth first search). Cases when not all the atoms are connected together.
- Returns
the dictionary of the connected components.
-
calculate_distance_to_r_groups
() → None[source]¶ To calculate the distance of each atom to its nearest R group (using the dijkstra’s algorithm).
- Returns
None:
-
find_cycles
(short_circuit: bool = False, cutoff: int = 40, seconds=50) → list[source]¶ To find the cycles in the compound.
- Parameters
short_circuit – whether to take short path.
cutoff – limit of cycle length.
seconds – the timeout limit.
- Returns
the list of cycles in the compound.
-
find_cycles_helper
(short_circuit: bool = False, cutoff: int = 40) → list[source]¶ Executing function to find the cycles in the compound.
- Parameters
short_circuit – whether to take short path.
cutoff – limit of cycle length.
- Returns
the list of cycles in the compound
-
structure_matrix
(resonance: bool = False, backbone: bool = False) → numpy.ndarray[source]¶ To construct graph structural matrix of this compound. matrix[i][j] = 0 suggests the two atoms are not connected directly. Other integer represented the bond type connecting the two atoms.
- Parameters
resonance – bool whether to ignore the difference between single and double bonds.
backbone – bool whether to ignore bond types. This is for parsing atoms mappings from KEGG RCLASS.
- Returns
the constructed structure matrix for this compound.
-
property
distance_matrix
¶ To construct the distance matrix of the compound (using the Floyd Warshall Algorithm). distance[i][j] suggests the distance between atom i and j.
- Returns
the distance matrix of the compound.
-
update_color_tuple
(resonance: bool = False) → None[source]¶ To update the color tuple of the atoms in the compound. This color tuple includes information of its neighboring atoms and bonds. Here, we don’t need to consider backbone since this part was initially designed for aromatic substructure detection and only double and single bonds are considered.
- Parameters
resonance – bool whether to ignore the difference between single and double bonds.
- Returns
None.
-
find_mappings
(the_other, resonance: bool = True, r_distance: bool = False, backbone: bool = False) → list[source]¶ Find the one to one atom mappings between two compounds using the BASS algorithm. The other compound is supposed be contained in the self compound.
- Parameters
the_other – the mappings compound entity.
resonance – whether to ignore the difference between single and double bonds.
r_distance – whether to take account of the position of R groups.
backbone – whether to ignore the bond types.
- Returns
the list of atom mappings in the heavy atom order.
-
find_mappings_reversed
(the_other, resonance: bool = True, r_distance: bool = False, backbone: bool = False) → list[source]¶ Find the one to one atom mappings between two compounds using the BASS algorithm. The self compound is supposed to be contained in the other compound.
- Parameters
the_other – the mappings compound entity.
resonance – whether to ignore the difference between single and double bonds.
r_distance – whether to take account of the position of R groups.
backbone – whether to ignore the bond types.
- Returns
the list of atom mappings in the heavy atom order.
-
map_resonance
(the_other, r_distance: bool = False, seconds: int = 50) → list[source]¶ Check if the resonant mappings are valid between two compound structures.
- Parameters
the_other – the mappings compound entity.
r_distance – to take account of the position of R groups.
seconds – the timeout limit.
- Returns
the list of valid atom mappings between the two compound structures.
-
map_resonance_helper
(the_other, r_distance: bool = False) → list[source]¶ Check if the resonant mappings are valid between the two compound structures. If the mapped atoms don’t share the same local coloring identifier, we check if the difference is caused by the position of double bonds. Find the three atoms involved in the resonant structure and check if one of the atoms is not C. N (a) N (a) / // (b) C N (c) (b) C N (c)
In addition, the self compound is supposed to be more generic, which means has fewer atoms. Therefore, atoms in self compound can all be mapped to the other compound.
- Parameters
the_other – the mappings compound entity.
r_distance – to take account of position of R groups.
- Returns
the list of valid atom mappings between the two compound structures.
-
find_double_bond_linked_atom
(i: int) → int[source]¶ Find the atom that is doubly linked to the target atom i.
- Parameters
i – the ith atom in the compound.
- Returns
the index of the doubly linked atom.
-
define_bond_stereochemistry
() → None[source]¶ Define the stereochemistry of double bonds in the compound.
- Returns
None.
-
calculate_bond_stereochemistry
(bond: md_harmonize.compound.Bond) → int[source]¶ Calculate the stereochemistry of the double bond based on its geometric properties. The line of the double bond divides the plane into two parts. For the atoms forming the double bond, it normally has two branches. If the two branches are not the same, we call them heavy side and light side (heavy side containing atoms with heavier atomic weights). We determine the bond stereochemistry by checking if the two heavy sides lie on the same part of the divided plane.
- Parameters
bond – the bond entity.
- Returns
the calculated bond stereochemistry.
-
static
calculate_y_coordinate
(slope: float, b: float, atom: md_harmonize.compound.Atom) → float[source]¶ Calculate the y coordinate of the atom based on the linear function: y = slope * x + b
- Parameters
slope – the slope of the targeted line.
b – the intercept of the targeted line.
atom – the atom entity.
- Returns
the calculated y coordinate.
-
collect_atomic_weights_of_neighbors
(neighbors: list) → list[source]¶ To collect the atomic weights of the current layer’s neighbors.
- Parameters
neighbors – the list of atom numbers of neighbors.
- Returns
the list of atomic weights for this layer’s neighbors.
-
compare_branch_weights
(neighbors: list, atom_forming_double_bond: md_harmonize.compound.Atom) → tuple[source]¶ To determine the heavy and light branches that connect to the atom forming the double bond. This is based on comparison of the atomic weights of the two branches (breadth first algorithm).
- Parameters
neighbors – the list of atom numbers of the atoms that connect the atom forming the double bond.
atom_forming_double_bond – the atom that forms the bond.
- Returns
heavy and light branches. [heavy_side, light_side]
-
get_next_layer_neighbors
(cur_layer_neighbors: list, visited: set, excluded: list = None) → list[source]¶ To get the next layer’s neighbors.
- Parameters
cur_layer_neighbors – the list of atom numbers of the current layer.
visited – the atom numbers that have already been visited.
excluded – the list of atom numbers that should not be included in the next layer.
- Returns
the neighboring atom numbers of the next layer.
-
color_compound
(r_groups: bool = True, bond_stereo: bool = False, atom_stereo: bool = False, resonance: bool = False, isotope_resolved: bool = False, charge: bool = False, backbone: bool = False) → None[source]¶ To color the compound.
- Parameters
r_groups – If true, add R groups in the coloring.
bond_stereo – If true, add bond stereo detail when constructing colors.
atom_stereo – If true, add atom stereo detail when constructing colors.
resonance – If true, ignore the difference between double and single bonds.
isotope_resolved – If true, add isotope detail when constructing colors.
charge – If true, add charge detail when constructing colors.
backbone – If true, ignore bond types in the coloring.
- Returns
None.
-
generate_atom_zero_layer_color
(isotope_resolved: bool = False, charge: bool = False, atom_stereo: bool = False) → None[source]¶ To generate the color identifier of zero layer for each atom. We don’t consider H and metals here.
- Parameters
isotope_resolved – If true, add isotope detail when constructing colors.
charge – If true, add charge detail when constructing colors.
atom_stereo – If true, add atom stereochemistry detail when constructing colors.
- Returns
None.
-
generate_atom_color_with_neighbors
(atom_index: list, excluded: list = None, zero_core_color: bool = True, zero_neighbor_color: bool = True, resonance: bool = False, bond_stereo: bool = False, backbone: bool = False) → dict[source]¶ To generate the atom color with its neighbors. We add this color name when we try to incorporate neighbors’ information in naming.
Here, we don’t need to care about the atom stereo. It has been taken care of in generating color_0.
Basic color formula: atom.color + [neighbor.color + bond.bond_type]
- Parameters
atom_index – indices of atoms to color.
excluded – the list of atom indices will be excluded from coloring.
zero_core_color – If true, we use the atom.color_0 else atom.color for the core atom (first round coloring vs validation).
zero_neighbor_color – If true, we use the atom.color_0 else atom.color for the neighbor atoms (first round coloring vs validation).
resonance – If true, detect resonant compound pairs without distinguishing between double and single bonds.
bond_stereo – If true, add stereo detail of bonds when constructing colors.
backbone – If true, ignore bond types in the coloring.
- Returns
the dictionary of atom index and its color name.
-
first_round_color
(atoms_to_color: list, excluded_index: list = None, bond_stereo: bool = False, resonance: bool = False, backbone: bool = False, depth: int = 5000) → None[source]¶ To do the first round of coloring this compound. We add neighbors’ information layer by layer to the atom’s color identifier until it has a unique identifier or all the atoms in the compound have been used for naming (based on the breadth first search algorithm).
- Parameters
atoms_to_color – the list of atom numbers to be colored.
excluded_index – the list of atom numbers to be excluded from coloring.
bond_stereo – If true, add bond stereo detail when constructing colors.
resonance – If true, ignore the difference between double and single bonds.
backbone – If true, ignore bond types in the coloring.
depth – the max depth of coloring.
- Returns
None.
-
invalid_symmetric_atoms
(atoms_to_color: list, excluded_index: bool = None, bond_stereo: bool = False, resonance: bool = False, backbone: bool = False) → list[source]¶ To check if atoms with the same color identifier are symmetric.
- Parameters
atoms_to_color – the list of atom numbers to be colored.
excluded_index – the list of atom numbers to be excluded from coloring.
bond_stereo – If true, add bond stereo detail when constructing colors.
resonance – If true, ignore the difference between double and single bonds.
backbone – If true, ignore bond types in the coloring.
- Returns
the list of atom numbers to be recolored.
-
curate_invalid_symmetric_atoms
(atoms_to_color: list, excluded_index: list = None, bond_stereo: bool = False, resonance: bool = False, backbone: bool = False) → None[source]¶ To curate the color identifiers of invalid symmetric atoms. We recolor those invalid atoms using the full color identifiers of its neighbors layer by layer until the difference can be captured.
- Parameters
atoms_to_color – the list of atom numbers of atoms to be colored.
excluded_index – the list of atom numbers of atoms to be excluded from coloring.
bond_stereo – If true, add stereo information to bonds when constructing colors.
resonance – If true, ignore the difference between double bonds and single bonds.
backbone – If true, ignore bond types in the coloring.
- Returns
None.
-
color_metal
(bond_stereo: bool = False, resonance: bool = True, backbone: bool = False) → None[source]¶ To color the metals in the compound. Here we just incorporate information of directly connected atoms.
- Parameters
bond_stereo – If true, add bond stereo detail when constructing colors.
resonance – If true, ignore difference between double and single bonds.
backbone – If true, ignore the bond types.
- Returns
None.
-
color_h
(bond_stereo: bool = False, resonance: bool = True, backbone: bool = False) → None[source]¶ To color the H in the compound. Here we just incorporate information of directly connected atoms.
- Parameters
bond_stereo – If true, add bond stereo detail when constructing colors.
resonance – If true, ignore difference between double and single bonds.
backbone – If true, ignore bond types.
- Returns
None.
-
metal_color_identifier
(details: bool = True) → str[source]¶ To generate the metal coloring string representation.
- Parameters
details – if true, to use full metal color when constructing identifier.
- Returns
the metal coloring string representation.
-
h_color_identifier
(details: bool = True) → str[source]¶ To generate the H coloring string representation.
- Parameters
details – if true, use the full H color when constructing identifier.
- Returns
the H coloring string representation.
-
backbone_color_identifier
(r_groups: bool = False) → str[source]¶ To generate the backbone coloring string representation for this compound. Exclude Hs and metals.
- Parameters
r_groups – whether to include the R group.
- Returns
the coloring string representation for this compound.
-
get_chemical_details
(excluded: list = None) → list[source]¶ To get the chemical details of the compound, which include the atom stereo chemistry and bond stereo chemistry. This is to compare the compound with the same structures (or the same color identifiers).
- Parameters
excluded – a list of atom indices to be ignored.
- Returns
the list of chemical details in the compound.
-
static
compare_chemical_details
(one_chemical_details: list, the_other_chemical_details: list) → tuple[source]¶ To compare the chemical details of the two compounds.
Then return the relationship between the two compounds.
The relationship can be equivalent, generic-specific and loose, represented by 0, (-1, 1), 2
- Parameters
one_chemical_details – the chemical details of one compound.
the_other_chemical_details – the chemical details of the other compound.
- Returns
the relationship between the two structures and the count of chemical details that cannot be mapped.
-
same_structure_relationship
(the_other_compound) → tuple[source]¶ To determine the relationship of two compounds with the same structure.
- Parameters
the_other_compound – the other
Compound
entity.- Returns
the relationship and the atom mappings between the two compounds.
-
generate_atom_mapping_by_atom_color
(the_other_compound) → dict[source]¶ To generate the atom mappings between the two compounds.
Assume the two compounds have the same structure, so we can achieve atom mappings through atom colors.
- Parameters
the_other_compound – the other
Compound
entity.- Returns
the atom mappings between the two compounds.
-
optimal_resonant_mapping
(the_other_compound, mappings: list) → tuple[source]¶ To find the optimal atom mappings for compound pairs that are resonant type.
- Parameters
the_other_compound – the other
Compound
entity.mappings – the list of atom mappings between the two compounds detected by BASS.
- Returns
the relationship and the atom mappings between the two compounds.
-
static
determine_relationship
(unmapped_count: dict) → int[source]¶ To determine the relationship between two compounds when there are multiple possible atom mappings.
We try to map as many details as possible.
0: equivalent; 1: self is more generic than the other compound; -1: the other compound is more generic than self; 2: either has chemical detail(s) that the other compound does not have.
- Parameters
unmapped_count – the dictionary of relationship to the count of details that cannot be mapped.
- Returns
the relationship between the two compounds.
-
circular_pair_relationship
(other_compound, seconds: int = 50) → tuple[source]¶ To determine the relationship of two compounds with interchangeable circular and linear representations with time limit.
- Parameters
other_compound – the other
Compound
entity.seconds – the timeout limit.
- Returns
the relationship and the atom mappings between the two compounds.
-
circular_pair_relationship_helper
(other_compound) → tuple[source]¶ To determine the relationship of two compounds with interchangeable circular and linear representations. We first find the critical atoms that involve in the formation of the ring. There can be several possibilities. Then we break the ring, and restore the double bond in the aldehyde group that forms the ring. Finally, check if the updated structure is the same with the other compound. And determine the relationship between the two compounds as well as generate the atom mappings.
- Parameters
other_compound – the other
Compound
entity.- Returns
the relationship and the atom mappings between the two compounds.
-
break_cycle
(critical_atoms: int) → None[source]¶ To break the cycle caused by aldol reaction, which often occurs in the sugar. Two steps are involved: 1) remove the neighbors. 2) restore the double bond in the aldehyde group.
- Parameters
critical_atoms – the three critical atoms that are involved in the ring formation.
- Returns
None.
-
restore_cycle
(critical_atoms: list) → None[source]¶ To restore the ring caused by aldol reaction. The reverse process of break_cycle.
- Parameters
critical_atoms – the three atoms are involved in the aldol reaction.
- Returns
None.
-
find_critical_atom_in_cycle
() → list[source]¶ To find the C (atom_c) and O (atom_oo) in aldehyde group, as well as O (atom_o) in the hydroxy that are involved in the ring formation. We need to break the bond between the atom_c and atom_o to form the linear transformation. Please check one example of aldol reaction in the sugar if the description is not confusing.
- Returns
the list of critical atoms.
-
update_atom_symbol
(index: list, updated_symbol: str) → None[source]¶ To update the atom symbols. This is often used to remove/restore R group.
- Parameters
index – the atom symbols of these indices to be updated.
updated_symbol – the updated symbol.
- Returns
None.
-
validate_mapping_with_r
(other_compound, one_rs: list, mapping: dict) → bool[source]¶ To validate the atom mappings with r groups. Here are two things we need to pay attention to:
For the generic compound, the R group can be mapped to a branch or just H in the specific compound.
For the specific compound, every unmatched branch needs to correspond to an R group in the generic compound.
In other words, the generic compound can have extra R groups that have no matched branch, but the specific compound cannot have unmatched branches that don’t correspond to any R groups.
For the specific validation:
1) We find all the linkages of R group and mapped atom in the compound, represented by the corresponding atom number in the other compound and the bond type. (We used the corresponding atom number in the other compound for the next comparison of the R linkages in the two compounds.
2) For every mapped atom in the other compound, we need to find if it has neighbors that are not mapped. Then the atom should be linked to a R group. We represent the linkage by the atom number and the bond type.
3) Based on the above validation criteria, we have to make sure that the R linkages in the other compound is the subset of the R linkages in this compound.
- Parameters
other_compound – the other
Compound
entity.one_rs – the R groups in the compound.
mapping – the atom mappings between the mapped parts of the two compounds.
- Returns
bool whether the atom mappings are valid.
-
compare_chemical_details_with_mapping
(other_compound, mapping: dict) → tuple[source]¶ To compare the chemical details of mapped atoms of the two compounds. This part targets compound pairs with resonance or r_group type. Only parts of chemicals need to be checked. 1) atoms are not involved in resonance part or connected to R groups (both cases can be tested by the first layer atom coloring identifier). 2) bond are formed by the atoms described above.
- Parameters
other_compound – the other
Compound
entity.mapping – the mapped atoms between the two compounds.
- Returns
the count of chemical details that cannot be mapped.
-
optimal_mapping_with_r
(other_compound, one_rs: list, mappings: list) → tuple[source]¶ To find the optimal mappings of compound pairs belonging to r_group type. In this case, multiple valid mappings can exist. We need to find the optimal one with the minimal unmapped chemical details. And the unmapped chemical details can exist in both compounds (generic or specific). The unmapped chemical details will determine the relationship of the compound pair. The priority: generic-specific, loose. The relationship cannot be equivalent.
- Parameters
other_compound – the other
Compound
entity.one_rs – the list of R groups in the compound.
mappings – the atom mappings of the mapped parts in the two compounds.
- Returns
the relationship and atom mappings between the two compounds.
-
with_r_pair_relationship
(other_compound, seconds: int = 50) → tuple[source]¶ To find the relationship and the atom mappings between the two compounds that have r_groups type with a time limit.
- Parameters
other_compound – the other
Compound
entity.seconds – the timeout limit.
- Returns
the relationship and the atom mappings between the two compounds.
-
with_r_pair_relationship_helper
(other_compound) → tuple[source]¶ To find the relationship and the atom mappings between the two compounds that have r_groups type. Several steps are involved:
1) Ignore the R groups in the two compounds and find if one compound (generic compound) is included in the other compound (specific compound).
2) If we can find the mappings, then we need to validate the mappings with the validate_mapping_with_r function.
3)Then we get the optimal atom mappings of the mapped parts.
4) We need to map the unmatched branches in the specific compound to the corresponding R group in the generic compound.
- Parameters
other_compound – the other
Compound
entity.- Returns
the relationship and the atom mappings between the two compounds.
-
map_r_correspondents
(one_rs: list, other_compound, mappings: dict) → dict[source]¶ To map the unmatched branches in the specific compound to the corresponding R group in the generic compound.
- Parameters
one_rs – the list of R groups in the compound.
other_compound – the other
Compound
entity.mappings – the atom mappings of the mapped parts in the two compounds.
- Returns
the full atom mappings between the two compounds.
md_harmonize.reaction¶
This module provides the Reaction
class entity.
-
class
md_harmonize.reaction.
Reaction
(reaction_name: str, one_side: list, other_side: list, ecs: dict, atom_mappings: list, coefficients: dict)[source]¶ Reaction class describes the
Reaction
entity.Reaction initializer.
- Parameters
reaction_name – the reaction name.
one_side – the list of
Compound
entities in one side of the reaction.other_side – the list of
Compound
entities in the other side of the reaction.ecs – the dict of Enzyme Commission numbers (EC numbers) of the reaction.
atom_mappings – the list of atom mappings between two sides of the reaction.
coefficients – the dictionary of compound names and their corresponding coefficients in the reaction.
-
__init__
(reaction_name: str, one_side: list, other_side: list, ecs: dict, atom_mappings: list, coefficients: dict) → None[source]¶ Reaction initializer.
- Parameters
reaction_name – the reaction name.
one_side – the list of
Compound
entities in one side of the reaction.other_side – the list of
Compound
entities in the other side of the reaction.ecs – the dict of Enzyme Commission numbers (EC numbers) of the reaction.
atom_mappings – the list of atom mappings between two sides of the reaction.
coefficients – the dictionary of compound names and their corresponding coefficients in the reaction.
-
property
name
¶ To get the reaction name.
- Returns
the reaction name.
md_harmonize.KEGG_database_scraper¶
This module provides functions to download KEGG data (including compound, reaction, kcf, and rclass) from the KEGG (REST) API.
The URLs can change.
-
md_harmonize.KEGG_database_scraper.
entry_list
(target_url: str) → list[source]¶ To get the list of entity name to download.
- Parameters
target_url – the url to fetch.
- Returns
the list of entry names.
-
md_harmonize.KEGG_database_scraper.
update_entity
(entries: list, sub_directory: str, directory: str, suffix: str = '') → None[source]¶ To download the KEGG entity (compound, reaction, or rclass) and save it into a file.
- Parameters
entries – the list of entry names to download.
sub_directory – the subdirectory to save the downloaded file.
directory – the main directory to save the downloaded file.
suffix – the suffix needed for download, like the mol for compound molfile and kcf for compound kcf file.
- Returns
None.
md_harmonize.KEGG_parser¶
This module provides functions to parse KEGG data (including compound, reaction, kcf, and rclass).
-
md_harmonize.KEGG_parser.
kegg_data_parser
(data: list) → dict[source]¶ This is to parse KEGG data (reaction, rclass, compound) file to a dictionary.
eg:
ENTRY R00259 Reaction
NAME acetyl-CoA:L-glutamate N-acetyltransferase
DEFINITION Acetyl-CoA + L-Glutamate <=> CoA + N-Acetyl-L-glutamate
EQUATION C00024 + C00025 <=> C00010 + C00624
RCLASS RC00004 C00010_C00024
RC00064 C00025_C00624
ENZYME 2.3.1.1
PATHWAY rn00220 Arginine biosynthesis
rn01100 Metabolic pathways
rn01110 Biosynthesis of secondary metabolites
rn01210 2-Oxocarboxylic acid metabolism
rn01230 Biosynthesis of amino acids
MODULE M00028 Ornithine biosynthesis, glutamate => ornithine
M00845 Arginine biosynthesis, glutamate => acetylcitrulline => arginine
ORTHOLOGY K00618 amino-acid N-acetyltransferase [EC:2.3.1.1]
K00619 amino-acid N-acetyltransferase [EC:2.3.1.1]
K00620 glutamate N-acetyltransferase / amino-acid N-acetyltransferase [EC:2.3.1.35 2.3.1.1]
K11067 N-acetylglutamate synthase [EC:2.3.1.1]
K14681 argininosuccinate lyase / amino-acid N-acetyltransferase [EC:4.3.2.1 2.3.1.1]
K14682 amino-acid N-acetyltransferase [EC:2.3.1.1]
K22476 N-acetylglutamate synthase [EC:2.3.1.1]
K22477 N-acetylglutamate synthase [EC:2.3.1.1]
K22478 bifunctional N-acetylglutamate synthase/kinase [EC:2.3.1.1 2.7.2.8]
DBLINKS RHEA: 24295
///
- Parameters
data – the KEGG reaction description.
- Returns
the dictionary of parsed KEGG data.
-
md_harmonize.KEGG_parser.
parse_equation
(equation: str) → tuple[source]¶ This is to parse the KEGG reaction equation.
eg: C00029 + C00001 + 2 C00003 <=> C00167 + 2 C00004 + 2 C00080
- Parameters
equation – the equation string.
- Returns
the parsed KEGG reaction equation.
-
md_harmonize.KEGG_parser.
kegg_kcf_parser
(kcf: list) → dict[source]¶ This is to parse KEGG kcf file to a dictionary.
eg:
ENTRY C00013 Compound
ATOM 9
1 P1b P 22.2269 -20.0662
2 O2c O 23.5190 -20.0779
3 O1c O 21.0165 -20.0779
4 O1c O 22.2851 -21.4754
5 O1c O 22.2617 -18.4642
6 P1b P 24.8933 -20.0837
7 O1c O 24.9401 -21.4811
8 O1c O 26.1797 -20.0662
9 O1c O 24.9107 -18.4582
BOND 8
1 1 2 1
2 1 3 1
3 1 4 1
4 1 5 2
5 2 6 1
6 6 7 1
7 6 8 1
8 6 9 2
///
- Parameters
kcf – the kcf text.
- Returns
the dictionary of parsed kcf file.
-
class
md_harmonize.KEGG_parser.
reaction_center
(i, kat, label, match, difference)¶ Create new instance of reaction_center(i, kat, label, match, difference)
-
__getnewargs__
()¶ Return self as a plain tuple. Used by copy and pickle.
-
static
__new__
(_cls, i, kat, label, match, difference)¶ Create new instance of reaction_center(i, kat, label, match, difference)
-
__repr__
()¶ Return a nicely formatted representation string
-
difference
¶ Alias for field number 4
-
i
¶ Alias for field number 0
-
kat
¶ Alias for field number 1
-
label
¶ Alias for field number 2
-
match
¶ Alias for field number 3
-
-
class
md_harmonize.KEGG_parser.
RpairParser
(rclass_name: str, rclass_definitions: list, one_compound: md_harmonize.compound.Compound, other_compound: md_harmonize.compound.Compound)[source]¶ This is to get one-to-one atom mappings between two compounds based on the rclass definition.
Several steps are involved in this process:
1. The rclass definition can have several pieces. Each piece describes a center atom (R) and its connected atoms. The connected atoms can stay the same (M) or change (D) between the two compound structures.
First we need to find the center atoms based on the rclass descriptions.
3. For each center atom, there are can multiple candidates. In other words, based on the RDM description, a bunch of atoms in the compound can meet the descriptions. (One simple case are the symmetric compounds).
Therefore, we need to generate the all the combinations for the center atoms in a compound.
eg: if there are three atom centers, each center has several candidates:
center 1: [0, 1, 2]; center 2: [5, 6]; center 3: [10, 11]
The combinations for the center atoms:
[0, 5, 10], [0, 5, 11], [0, 6, 10], [0, 6, 11], [1, 5, 10], [1, 5, 11], [1, 6, 10], [1, 6, 11], [2, 5, 10], [2, 5, 11], [2, 6, 10], [2, 6, 11]
Next, we need to find the one-to-one atom mappings between the two compounds based on the mapped center atoms.
6. To solve this issue, we first disassemble each compound into different components. This is due to the difference atoms in the two compounds, i.e. broken bonds.
7. Then we need to find the mappings between each disassembled component, and concatenate the mappings of all the components.
8. To find the one-to-one atom mappings, we use the BASS algorithm. We assume the mapped component have the same structure since we have already removed the different parts. However, here we only map the backbone of the structure (in other words, we simply all the bond type to 1) due to bond change (double bond to single bond or triple bond to single bond)
9. To ensure the optimal mappings, we count the mapped atoms with changed local environment and choose the mapping with minimal changes.
RpairParser initializer.
- Parameters
rclass_name – the rclass name.
rclass_definitions – a list of rclass definitions.
one_compound – one compound involved in the pair.
other_compound – the other compound involved in the pair.
-
__init__
(rclass_name: str, rclass_definitions: list, one_compound: md_harmonize.compound.Compound, other_compound: md_harmonize.compound.Compound)[source]¶ RpairParser initializer.
- Parameters
rclass_name – the rclass name.
rclass_definitions – a list of rclass definitions.
one_compound – one compound involved in the pair.
other_compound – the other compound involved in the pair.
-
map_atom_by_colors
() → dict[source]¶ Roughly map the atoms between the two compounds by the atom color.
- Returns
the dict of mapped atom index between the two compounds.
-
map_whole_compound
() → dict[source]¶ Map two compounds if the two compounds can be roughly mapped by the atom color.
- Returns
the dict of mapped atom in the two compounds.
-
static
generate_kat_neighbors
(this_compound: md_harmonize.compound.Compound) → list[source]¶ Generate the atom neighbors represented by KEGG atom type for each atom in the compound. This is used to find the center atom. We used KEGG atom type since the descriptions of atoms in the rclass definitions using KEGG atom type.
- Parameters
this_compound – the compound entity.
- Returns
the list of atom with its neighbors.
-
static
find_target_atom
(atoms: list, target: tuple) → list[source]¶ Find the target atoms from a list of candidate atoms.
- Parameters
atoms – a list of atoms to search from.
target – the target atom to be searched.
- Returns
the list of atom numbers that match the target atom.
-
static
create_reaction_centers
(i: int, kat: str, difference: list, the_other_difference: list, match: list, the_other_match: list) → collections.namedtuple[source]¶ Create the center atom based on its connected atoms and its counterpart atom in the other compound.
- Parameters
i – the ith rclass definition.
kat – the KEGG atom type of the center atom.
difference – the list of KEGG atom type of different connected atoms.
the_other_difference – the list of KEGG atom type of different connected atoms in the other compound.
match – the list of KEGG atom type of the matched connected atoms.
the_other_match – the list of KEGG atom type of the matched connected atoms in the other compound.
- Returns
the constructed reaction center.
-
find_center_atoms
() → tuple[source]¶ Example of rclass definition:
C8x-C8y:*-C1c:N5y+S2x-N5y+S2x
The RDM pattern is defined as KEGG atom type changes at the reaction center (R), the difference region (D), and the matched region (M) for each reactant pair. It characterizes chemical structure transformation patterns associated with enzymatic reactions.
- Returns
the list of reaction centers and the corresponding candidate atoms.
-
static
get_center_list
(center_atom_index: list) → list[source]¶ Generate all the combinations of reaction centers.
- Parameters
center_atom_index – list of atom index list for each reaction centers. eg: three reaction centers: [[0, 1, 2], [5, 6], [10, 11]].
- Returns
the list of combined reaction centers. eg: [[0, 5, 10], [0, 5, 11], [0, 6, 10], [0, 6, 11], [1, 5, 10], [1, 5, 11], [1, 6, 10], [1, 6, 11], [2, 5, 10], [2, 5, 11], [2, 6, 10], [2, 6, 11]]
-
static
remove_different_bonds
(this_compound: md_harmonize.compound.Compound, center_atom_numbers: list, reaction_centers: list) → list[source]¶ Remove the bonds connecting to different atoms. For each reaction center, multiple atoms can be the different atoms. We need to get all the combinations.
- Parameters
this_compound – the
Compound
entity.center_atom_numbers – the list of atom numbers for center atom in the compound.
reaction_centers – the list of reaction center descriptions for the compound.
- Returns
the list of bonds (represented by the atom numbers in the bond) that needs to be removed based on the RDM descriptions.
-
generate_atom_mappings
() → list[source]¶ Generate the one-to-one atom mappings of the compound pair.
- Returns
the list of atom mappings.
-
static
detect_components
(this_compound: md_harmonize.compound.Compound, removed_bonds: list, center_atom_numbers: list) → list[source]¶ Detect all the components in the compound after removing some bonds. Basic idea is the breadth first search algorithm.
- Parameters
this_compound – the
Compound
entity.removed_bonds – the list of removed bonds (represented by the atom numbers in the bond) in the compound.
center_atom_numbers – the list of atom numbers of the center atoms in the compound.
- Returns
the list of components of the compound represented by a list of atom numbers.
-
static
pair_components
(left_components: list, right_components: list) → list[source]¶ The two compounds are divided into separate components due to the difference atoms. We need to pair each component in one compound to its counterpart component in the other compound. Here we roughly pair the components based on the number of atoms in the component. Therefore, every component in one compound can be paired with several components in the other compound.
- Parameters
left_components – the components in one compound.
right_components – the components in the other compound.
- Returns
the list of paired components.
-
static
construct_component
(this_compound: md_harmonize.compound.Compound, atom_numbers: list, removed_bonds: list) → md_harmonize.compound.Compound[source]¶ Construct a
Compound
entity for the component based on the atom index and removed bonds, facilitating the following atom mappings.- Parameters
this_compound – the
Compound
entity.atom_numbers – the list of atom numbers in the component.
removed_bonds – the list of removed bonds (represented by the atom numbers in the bond) in the compound.
- Returns
the constructed component compound.
-
static
preliminary_atom_mappings_check
(left_component: md_harmonize.compound.Compound, right_component: md_harmonize.compound.Compound) → bool[source]¶ Roughly evaluate if the atoms between the two components can be mapped. We compare if the every atom color in the left component has its counterpart in the right component. Here, we only consider the backbone of the structure.
- Parameters
left_component – the component in one compound.
right_component – the component in the other compound.
- Returns
bool whether the atoms in the two components can be mapped.
-
map_components
(left_removed_bonds: list, right_removed_bonds: list, left_centers: list, right_centers: list) → dict[source]¶ Find optimal map for every component in the compound pair.
- Parameters
left_removed_bonds – the list of removed bonds in one compound.
right_removed_bonds – the list of removed bonds in the other compound.
left_centers – the list of atom numbers of the center atoms in the one compound.
right_centers – the list of atom numbers of the center atoms in the other compound.
- Returns
the atom mappings for the compound pair based on the removed bonds and center atoms.
-
combine_atom_mappings
(atom_mappings: list) → dict[source]¶ Combine the atom mappings of all the components. We just mentioned in the pair_components function that every component can have several mappings. Here, we choose the optimal mapping with the least count of changed atom local identifier. And make sure that each atom can only to mapped once.
- Parameters
atom_mappings – the list of atom mappings for all the components.
- Returns
the atom mappings for the compound pair.
-
static
validate_component_atom_mappings
(left_centers: list, right_centers: list, component_atom_mappings: dict) → bool[source]¶ Check if mapped the atoms can correspond to the mapped reaction center atoms.
- Parameters
left_centers – the list of center atom indices in the left compound.
right_centers – the list of center atom indices in the right compound.
component_atom_mappings – the one to one atom mappings of one component.
- Returns
bool whether the mappings are valid.
-
count_changed_atom_identifiers
(one_to_one_mappings: dict) → int[source]¶ Count the mapped atoms with changed local atom identifier. The different atoms (D in RCLASS definitions) can cause change of local environment, which can change the atom identifier.
- Parameters
one_to_one_mappings – the dictionary of atom mappings between the two compounds.
- Returns
the total number of mapped atoms with different local identifiers.
-
md_harmonize.KEGG_parser.
create_compound_kcf
(kcf_file: str) → Optional[md_harmonize.compound.Compound][source]¶ Construct compound entity based on the KEGG kcf file.
- Parameters
kcf_file – the path to the kcf file.
- Returns
the constructed compound entity.
-
md_harmonize.KEGG_parser.
create_reactions
(reaction_directory: str, atom_mappings: dict) → list[source]¶ Create KEGG
Reaction
entities.- Parameters
reaction_directory – the directory that stores all the reaction files.
atom_mappings – the compound pair name and its atom mappings.
- Returns
the constructed
Reaction
entities.
-
md_harmonize.KEGG_parser.
compound_pair_mappings
(pair_component: tuple) → tuple[source]¶ Get the atom mappings between two compounds based on the rclass definitions.
- Parameters
pair_component – a tuple containing the rclass_name, rclass_definitions, one_compound and the_other_compound.
- Returns
the compound pair name and its atom mappings.
-
md_harmonize.KEGG_parser.
create_atom_mappings
(rclass_directory: str, compounds: dict, seconds: int = 1200) → dict[source]¶ Generate the atom mappings between compounds based on RCLASS definitions.
- Parameters
rclass_directory – the directory that stores the rclass files.
compounds – a dictionary of
Compound
entities.seconds – the timeout limit.
- Returns
the atom mappings of compound pairs.
md_harmonize.MetaCyc_parser¶
This module provides functions to parse MetaCyc text data.
Note: All MetaCyc reactions atom_mappings are stored in a single text file.
-
md_harmonize.MetaCyc_parser.
reaction_side_parser
(reaction_side: str) → dict[source]¶ This is to parse FROM_SIDE or TO_SIDE in the reaction.
eg: FROM-SIDE - (CPD-9147 0 8) (OXYGEN-MOLECULE 9 10)
Information includes compound name and the start and end atom index in this compound used for atom mappings. The order of the atoms are the orders in the compound molfile.
- Parameters
reaction_side – the text description of reaction side.
- Returns
the dictionary of compounds and the corresponding start and end atom index in the atom mappings.
-
md_harmonize.MetaCyc_parser.
generate_one_to_one_mappings
(from_side: dict, to_side: dict, indices: str) → list[source]¶ To generate the one to one atom mappings between two the sides of a metabolic reaction.
- Parameters
from_side – the dictionary of compounds with their corresponding start and end atom indices in the from_side.
to_side – the dictionary of compounds with their corresponding start and end atom indices in the to_side.
indices – the string representation of mapped atoms.
- Returns
the list of mapped atoms between the two sides (from_index, to_index).
-
md_harmonize.MetaCyc_parser.
atom_mappings_parser
(atom_mapping_text: list) → dict[source]¶ This is to parse the MetaCyc reaction with atom mappings.
eg:
REACTION - RXN-11981
NTH-ATOM-MAPPING - 1
MAPPING-TYPE - NO-HYDROGEN-ENCODING
FROM-SIDE - (CPD-12950 0 23) (WATER 24 24)
TO-SIDE - (CPD-12949 0 24)
INDICES - 0 1 2 3 5 4 7 6 9 10 11 13 12 14 15 16 17 8 18 19 21 20 22 24 23
note: the INDICES are atom mappings between two sides of the reaction. TO-SIDE[i] is mapped to FROM-SIDE[idx] for i, idx in enumerate(INDICES). Pay attention to the direction!
- Parameters
atom_mapping_text – the text descriptions of reactions with atom mappings.
- Returns
the dictionary of reactions with atom mappings.
-
md_harmonize.MetaCyc_parser.
reaction_parser
(reaction_text: list) → dict[source]¶ This is used to parse MetaCyc reaction.
eg:
UNIQUE-ID - RXN-13583
TYPES - Redox-Half-Reactions
ATOM-MAPPINGS - (:NO-HYDROGEN-ENCODING (1 0 2) (((WATER 0 0) (HYDROXYLAMINE 1 2)) ((NITRITE 0 2))))
CREDITS - SRI
CREDITS - caspi
IN-PATHWAY - HAONITRO-RXN
LEFT - NITRITE
^COMPARTMENT - CCO-IN
LEFT - PROTON
^COEFFICIENT - 5
^COMPARTMENT - CCO-IN
LEFT - E-
^COEFFICIENT - 4
ORPHAN? - :NO
PHYSIOLOGICALLY-RELEVANT? - T
REACTION-BALANCE-STATUS - :BALANCED
REACTION-DIRECTION - LEFT-TO-RIGHT
RIGHT - HYDROXYLAMINE
^COMPARTMENT - CCO-IN
RIGHT - WATER
^COMPARTMENT - CCO-IN
STD-REDUCTION-POTENTIAL - 0.1
//
- Parameters
reaction_text – the text descriptions of MetaCyc reactions.
- Returns
the dict of parsed MetaCyc reactions.
-
md_harmonize.MetaCyc_parser.
create_reactions
(reaction_file: str, atom_mapping_file: str) → list[source]¶ To molfile_name MetaCyc reaction entities.
- Parameters
reaction_file – the path to the reaction file.
atom_mapping_file – the path to the atom mapping file.
- Returns
the list of constructed
Reaction
entities.
md_harmonize.aromatics¶
This module provides the AromaticManager
class entity.
-
class
md_harmonize.aromatics.
AromaticManager
(aromatic_substructures: list = None)[source]¶ Two major functions are implemented in AromaticManager.
- Extract aromatic substructures based on labelled aromatic atoms (mainly C and N) or Indigo detected aromatic bonds.
The first case only applies to KEGG compounds, and the second case applies to compounds from any databases.
Detect the aromatic substructures in any given compound, and update the bond type of the detected aromatic bonds.
AromaticManager initializer.
- Parameters
aromatic_substructures – a list of aromatic substructures.
-
__init__
(aromatic_substructures: list = None) → None[source]¶ AromaticManager initializer.
- Parameters
aromatic_substructures – a list of aromatic substructures.
-
encode
() → list[source]¶ To encode the aromatic substructures in the aromatic manager. (Get error when try to jsonpickle the AromaticManager: the cythonized entities cannot be pickled.)
- Returns
the list of aromatic substructures.
-
static
decode
(aromatic_structures: list)[source]¶ Construct the AromaticManager based on the aromatic substructures.
- Parameters
aromatic_structures – the list of aromatic substructures.
- Returns
the constructed AromaticManager.
-
add_aromatic_substructures
(substructures: list) → None[source]¶ Add newly detected aromatic structures to the manager. Make sure no duplicates in the aromatic substructures.
- Parameters
substructures – a list of aromatic substructures.
- Returns
None.
-
kegg_aromatize
(kcf_cpd: md_harmonize.compound.Compound) → None[source]¶ Extract aromatic substructures based on KEGG atom type in KEGG compound parsed from KCF file, and add the newly detected aromatic substructures to the AromaticManager.
- Parameters
kcf_cpd – the KEGG compound entity derived from KCF file.
- Returns
None.
-
indigo_aromatize
(molfile: str) → None[source]¶ Extract aromatic substructures via Indigo, and add the newly detected aromatic substructures to the AromaticManager.
- Parameters
molfile – the path to the molfile.
- Returns
None.
-
indigo_aromatic_bonds
(molfile: str) → set[source]¶ Detect the aromatic bonds in the compound via Indigo method.
- Parameters
molfile – the path to the molfile.
- Returns
the set of aromatic bonds represented by first_atom_number and second_atom_number in the bond.
-
static
fuse_cycles
(cycles: list) → list[source]¶ To fuse the cycles with shared atoms.
- Parameters
cycles – the list of cycles represented by atom numbers.
- Returns
the list of cleaned cycles.
-
detect_aromatic_substructures_timeout
(cpd: md_harmonize.compound.Compound) → None[source]¶ Detect the aromatic substructures in the compound and stop the search on timeout.
- Parameters
cpd – the
Compound
entity.- Returns
None.
-
detect_aromatic_substructures
(cpd: md_harmonize.compound.Compound) → None[source]¶ Detect all the aromatic substructures in the cpd, and update the bond type of aromatic bonds.
- Parameters
cpd – the
Compound
entity.- Returns
None.
-
static
construct_aromatic_entity
(cpd: md_harmonize.compound.Compound, aromatic_cycles: list) → list[source]¶ Construct the aromatic substructure entity based on the aromatic atoms. Here, we also include outside atoms that are connected to aromatic rings with double bonds.
- Parameters
cpd – the
Compound
entity.aromatic_cycles – the list of aromatic cycles represented by atom numbers in the compound.
- Returns
the list of constructed aromatic substructures.
md_harmonize.harmonization¶
This module provides the HarmonizedEdge
class,
the HarmonizedCompoundEdge
class,
and the HarmonizedReactionEdge
class .
-
class
md_harmonize.harmonization.
HarmonizedEdge
(one_side: Union[md_harmonize.compound.Compound, md_harmonize.reaction.Reaction], other_side: Union[md_harmonize.compound.Compound, md_harmonize.reaction.Reaction], relationship: int, edge_type: Union[str, int], mappings: dict)[source]¶ The HarmonizedEdge representing compound or reaction pairs.
HarmonizedEdge initializer.
- Parameters
one_side – one side of the edge. This can be compound or reaction.
other_side – the other side of the edge. This can be compound or reaction.
relationship – equivalent, generic-specific, or loose.
edge_type – for compound edge, this represents resonance, linear-circular, r group, same structure; for reaction edge, this represents 3 level match or 4 level match.
mappings – for compound edge, the mappings refer to mapped atoms between compounds; for reaction edge, the mappings refer to mapped compounds between reaction.
-
__init__
(one_side: Union[md_harmonize.compound.Compound, md_harmonize.reaction.Reaction], other_side: Union[md_harmonize.compound.Compound, md_harmonize.reaction.Reaction], relationship: int, edge_type: Union[str, int], mappings: dict) → None[source]¶ HarmonizedEdge initializer.
- Parameters
one_side – one side of the edge. This can be compound or reaction.
other_side – the other side of the edge. This can be compound or reaction.
relationship – equivalent, generic-specific, or loose.
edge_type – for compound edge, this represents resonance, linear-circular, r group, same structure; for reaction edge, this represents 3 level match or 4 level match.
mappings – for compound edge, the mappings refer to mapped atoms between compounds; for reaction edge, the mappings refer to mapped compounds between reaction.
-
property
reversed_relationship
¶ Get the relationship between the other side and one side.
- Returns
the reversed relationship.
-
class
md_harmonize.harmonization.
HarmonizedCompoundEdge
(one_compound: md_harmonize.compound.Compound, other_compound: md_harmonize.compound.Compound, relationship: int, edge_type: str, atom_mappings: dict)[source]¶ The HarmonizedCompoundEdge representing compound pairs.
HarmonizedCompoundEdge initializer.
- Parameters
one_compound – one
Compound
entity in the compound pair.other_compound – the other
Compound
entity in the compound pair.relationship – the relationship (equivalent, generic-specific, or loose) between the two compounds.
edge_type – the edge type can be resonance, linear-circular, r group, or same structure.
atom_mappings – the atom mappings between the two compounds.
-
__init__
(one_compound: md_harmonize.compound.Compound, other_compound: md_harmonize.compound.Compound, relationship: int, edge_type: str, atom_mappings: dict) → None[source]¶ HarmonizedCompoundEdge initializer.
- Parameters
one_compound – one
Compound
entity in the compound pair.other_compound – the other
Compound
entity in the compound pair.relationship – the relationship (equivalent, generic-specific, or loose) between the two compounds.
edge_type – the edge type can be resonance, linear-circular, r group, or same structure.
atom_mappings – the atom mappings between the two compounds.
-
property
reversed_mappings
¶ Get the atom mappings from compound on the other side to compound on the one side.
- Returns
atom mappings between the other side compound to one side compound.
-
class
md_harmonize.harmonization.
HarmonizedReactionEdge
(one_reaction: md_harmonize.reaction.Reaction, other_reaction: md_harmonize.reaction.Reaction, relationship: int, edge_type: int, compound_mappings: dict)[source]¶ The HarmonizedReactionEdge to represent reaction pairs.
HarmonizedReactionEdge initializer.
- Parameters
one_reaction – one
Reaction
entity in the reaction pair.other_reaction – the other
Reaction
entity in the reaction pair.relationship – the relationship (equivalent, generic-specific, or loose) between the two reactions.
edge_type – the reactions can be 3-level EC or 4-level EC paired.
compound_mappings – the dictionary of paired compounds in the reaction pair.
-
__init__
(one_reaction: md_harmonize.reaction.Reaction, other_reaction: md_harmonize.reaction.Reaction, relationship: int, edge_type: int, compound_mappings: dict) → None[source]¶ HarmonizedReactionEdge initializer.
- Parameters
one_reaction – one
Reaction
entity in the reaction pair.other_reaction – the other
Reaction
entity in the reaction pair.relationship – the relationship (equivalent, generic-specific, or loose) between the two reactions.
edge_type – the reactions can be 3-level EC or 4-level EC paired.
compound_mappings – the dictionary of paired compounds in the reaction pair.
-
class
md_harmonize.harmonization.
HarmonizationManager
[source]¶ The HarmonizationManger is responsible for adding, removing or searching harmonized edge.
HarmonizationManager initializer.
-
save_manager
() → list[source]¶ Save all the names of harmonized edges.
- Returns
the list of harmonized edges.
-
static
create_key
(name_1: str, name_2: str) → str[source]¶ Create the edge key. Each edge is represented by a unique key in the harmonized_edges dictionary.
- Parameters
name_1 – the name of one side of the edge.
name_2 – the name of the other side of the edge.
- Returns
the key of the edge.
-
add_edge
(edge: Union[md_harmonize.harmonization.HarmonizedCompoundEdge, md_harmonize.harmonization.HarmonizedReactionEdge]) → bool[source]¶ Add this edge to the harmonized edges.
- Parameters
edge – the
HarmonizedEdge
entity.- Returns
bool whether the edge is added successfully.
-
remove_edge
(edge: Union[md_harmonize.harmonization.HarmonizedCompoundEdge, md_harmonize.harmonization.HarmonizedReactionEdge]) → bool[source]¶ Remove this edge from the harmonized edges.
- Parameters
edge – the
HarmonizedEdge
entity.- Returns
bool whether the edge is removed successfully.
-
search
(name_1: str, name_2: str) → Union[md_harmonize.harmonization.HarmonizedCompoundEdge, md_harmonize.harmonization.HarmonizedReactionEdge, None][source]¶ Search the edge based on the names of the two sides.
- Parameters
name_1 – the name of one side of the edge.
name_2 – the name of the other side of the edge.
- Returns
edge if the edge exists or None.
-
-
class
md_harmonize.harmonization.
CompoundHarmonizationManager
[source]¶ The CompoundHarmonizationManager is responsible for adding, removing or searching
HarmonizedCompoundEdge
.CompoundHarmonizationManager initializer.
-
static
find_compound
(compound_dict: list, compound_name: str) → Optional[md_harmonize.compound.Compound][source]¶ Find the
Compound
based on the compound name in the compound dict.- Parameters
compound_dict – a list of compound dictionaries.
compound_name – the target compound name.
- Returns
the
Compound
.
-
static
create_manager
(compound_dict: list, compound_pairs: list)[source]¶ Create the
CompoundHarmonizationManager
based on the compound paris.- Parameters
compound_dict – the list of compound dictionaries.
compound_pairs – the list of compound pairs.
- Returns
-
add_edge
(edge: md_harmonize.harmonization.HarmonizedCompoundEdge) → bool[source]¶ Add a newly detected edge to the manager, and update the occurrences of compound in the harmonized edges. This is for calculating the jaccard index.
- Parameters
edge – the
HarmonizedCompoundEdge
entity.- Returns
bool whether the edge is added successfully.
-
remove_edge
(edge: md_harmonize.harmonization.HarmonizedCompoundEdge) → bool[source]¶ Remove the edge from the manager, and update the occurrences of compound in the harmonized edges.
- Parameters
edge – the
HarmonizedCompoundEdge
entity.- Returns
bool whether the edge is removed successfully.
-
has_visited
(name_1: str, name_2: str) → bool[source]¶ Check if the compound pair has been visited.
- Parameters
name_1 – the name of one side of the edge.
name_2 – the name of the other side of the edge.
- Returns
bool if the pair has been visited.
-
static
-
class
md_harmonize.harmonization.
ReactionHarmonizationManager
(compound_harmonization_manager: md_harmonize.harmonization.CompoundHarmonizationManager)[source]¶ The ReactionHarmonizationManager is responsible for adding, removing or searching
HarmonizedReactionEdge
.ReactionHarmonizationManager initializer.
- Parameters
compound_harmonization_manager – the
CompoundHarmonizationManager
entity for compound pairs management.
-
__init__
(compound_harmonization_manager: md_harmonize.harmonization.CompoundHarmonizationManager) → None[source]¶ ReactionHarmonizationManager initializer.
- Parameters
compound_harmonization_manager – the
CompoundHarmonizationManager
entity for compound pairs management.
-
static
compare_ecs
(one_ecs: dict, other_ecs: dict) → int[source]¶ Compare two lists of EC numbers.
- Parameters
one_ecs – the dict of EC numbers of one reaction.
other_ecs – the dict of EC numbers of the other reaction.
- Returns
the level of EC number that they can be matched.
-
static
determine_relationship
(relationships: list) → int[source]¶ Determine the relationship of the reaction pair based on the relationship of paired compounds.
- Parameters
relationships – the list of relationship of compound pairs in the two reactions.
- Returns
the relationships between the two reactions.
-
harmonize_reaction
(one_reaction: md_harmonize.reaction.Reaction, other_reaction: md_harmonize.reaction.Reaction) → None[source]¶ Test if two reactions can be harmonized.
-
compound_mappings
(one_compounds: list, other_compounds: list) → dict[source]¶ Get the mapped compounds in the two compound lists.
-
unmapped_compounds
(one_compounds: list, other_compounds: list, mappings: dict) → tuple[source]¶ Get the compounds that cannot be mapped. This can lead to new compound pairs.
-
match_unmapped_compounds
(one_side_left: list, other_side_left: list) → None[source]¶ Match the left compounds and add the valid compound pairs to the
CompoundHarmonizationManager
. We also add the invalid compound pairs to theCompoundHarmonizationManager
to avoid redundant match.
-
jaccard
(one_compounds: list, other_compounds: list, mappings: dict) → float[source]¶ Calculate the jaccard index between the two list of compounds.
-
one_to_one_compound_mappings
(mappings: dict) → Optional[tuple][source]¶ Find the one-to-one compound mappings between the two reactions. This step is to avoid very extreme cases that a compound in one reaction can be mapped to two or more compounds in the other reaction.
- Parameters
mappings – the dictionary of compound mappings.
- Returns
the tuple of relationship of compound pairs and dictionary of one-to-one compound mappings.
-
md_harmonize.harmonization.
harmonize_compound_list
(compound_dict_list: list) → md_harmonize.harmonization.CompoundHarmonizationManager[source]¶ Harmonize compounds across different databases based on the compound coloring identifier.
- Parameters
compound_dict_list – the list of
Compound
dictionary from different sources.- Returns
the
CompoundHarmonizationManager
containing harmonized compound edges.
-
md_harmonize.harmonization.
harmonize_reaction_list
(reaction_lists: list, compound_harmonization_manager: md_harmonize.harmonization.CompoundHarmonizationManager) → md_harmonize.harmonization.ReactionHarmonizationManager[source]¶ Harmonize reactions across different sources based on the harmonized compounds. At the same time, this also harmonizes compound pairs with resonance, linear-circular, r group types.
- Parameters
reaction_lists – a list of
Reaction
lists from different sources.compound_harmonization_manager – a
CompoundHarmonizationManager
containing harmonized compound pairs with the same structure.
- Returns