The md_harmonize API Reference

This package includes the following modules.

md_harmonize.compound

This module provides the Atom class, the Bond class, and the Compound class to construct a compound entity. Most of the instance variables of these three classes are based on CTFile fields.

class md_harmonize.compound.Atom(atom_symbol: str, atom_number: int, x: str = '0', y: str = '0', z: str = '0', mass_difference: str = '0', charge: str = '0', atom_stereo_parity: str = '0', hydrogen_count: str = '0', stereo_care_box: str = '0', valence: str = '0', h0designator: str = '0', atom_atom_mapping_number: str = '0', inversion_retention_flag: str = '0', exact_change_flag: str = '0', kat: str = '', in_cycle: bool = False)[source]

Atom class describes the Atom entity in the compound.

Atom initializer.

Parameters
  • atom_symbol – atom_symbol.

  • atom_number – atom_number.

  • x – the atom x coordinate.

  • y – the atom y coordinate.

  • z – the atom z coordinate.

  • mass_difference – difference from mass in periodic table.

  • charge – charge.

  • atom_stereo_parity – atom stereo parity.

  • hydrogen_count – hydrogen_count.

  • stereo_care_box – stereo_care_box.

  • valence – valence.

  • h0designator – h0designator (obsolete CTFile parameter).

  • atom_atom_mapping_number – atom_atom_mapping_number.

  • inversion_retention_flag – inversion_retention_flag.

  • exact_change_flag – exact_change_flag.

  • kat – KEGG atom type.

  • in_cycle – whether the atom is in cycle.

__init__(atom_symbol: str, atom_number: int, x: str = '0', y: str = '0', z: str = '0', mass_difference: str = '0', charge: str = '0', atom_stereo_parity: str = '0', hydrogen_count: str = '0', stereo_care_box: str = '0', valence: str = '0', h0designator: str = '0', atom_atom_mapping_number: str = '0', inversion_retention_flag: str = '0', exact_change_flag: str = '0', kat: str = '', in_cycle: bool = False)[source]

Atom initializer.

Parameters
  • atom_symbol – atom_symbol.

  • atom_number – atom_number.

  • x – the atom x coordinate.

  • y – the atom y coordinate.

  • z – the atom z coordinate.

  • mass_difference – difference from mass in periodic table.

  • charge – charge.

  • atom_stereo_parity – atom stereo parity.

  • hydrogen_count – hydrogen_count.

  • stereo_care_box – stereo_care_box.

  • valence – valence.

  • h0designator – h0designator (obsolete CTFile parameter).

  • atom_atom_mapping_number – atom_atom_mapping_number.

  • inversion_retention_flag – inversion_retention_flag.

  • exact_change_flag – exact_change_flag.

  • kat – KEGG atom type.

  • in_cycle – whether the atom is in cycle.

update_symbol(symbol: str) → str[source]

To update the atom symbol.

Parameters

symbol – the updated atom symbol.

Returns

the updated atom_symbol.

update_atom_number(index: int) → int[source]

To update the atom number.

Parameters

index – the updated atom number.

Returns

the updated atom number.

remove_neighbors(neighbors: list) → list[source]

To remove neighbors from the atom.

Parameters

neighbors – the list of neighbors that will be removed from this atom.

Returns

the updated list of neighbors.

add_neighbors(neighbors: list) → list[source]

To add neighbors to the atom.

Parameters

neighbors – the list of neighbors that will be added to this atom.

Returns

the updated list of neighbors.

update_stereochemistry(stereo: str) → str[source]

To update the atom stereochemistry.

Parameters

stereo – the updated atom stereochemistry.

Returns

the updated atom stereochemistry.

color_atom(isotope_resolved: bool = False, charge: bool = False, atom_stereo: bool = False) → str[source]

To generate the atom color of the zero layer.

Parameters
  • isotope_resolved – If true, add isotope information when constructing colors.

  • charge – If true, add charge information when constructing colors.

  • atom_stereo – If true, add atom stereochemistry information when constructing colors.

Returns

the atom color of the zero layer.

reset_color() → None[source]

Reset the atom color.

Returns

None.

update_kat(kat: str) → str[source]

To update the atom KEGG atom type.

Parameters

kat – the KEGG atom type for this atom,

Returns

the updated KEGG atom type.

update_cycle(cycle_status: bool) → bool[source]

To update the cycle status of the atom

Parameters

cycle_status – whether the atom is in cycle

Returns

cycle status

clone()[source]

To clone the atom.

Returns

the cloned atom.

class md_harmonize.compound.Bond(first_atom_number: str, second_atom_number: str, bond_type: str, bond_stereo: str = '0', bond_topology: str = '0', reacting_center_status: str = '0')[source]

Bond class describes the Bond entity in the compound.

Bond initializer.

Parameters
  • first_atom_number – the index of the first atom forming this bond.

  • second_atom_number – the index of the second atom forming this bond.

  • bond_type – the bond type. (1 = Single, 2 = Double, 3 = Triple, 4 = Aromatic, 5 = Single or Double, 6 = Single or Aromatic, 7 = double or Aromatic 8 = Any)

  • bond_stereo – the bond stereo. (Single bonds: 0 = not stereo, 1 = Up, 4 = Either, 6 = Down; Double bonds: determined by x, y, z coordinates)

  • bond_topology – bond topology. (O = Either, 1 = Ring, 2 = Chain)

  • reacting_center_status – reacting center status.

__init__(first_atom_number: str, second_atom_number: str, bond_type: str, bond_stereo: str = '0', bond_topology: str = '0', reacting_center_status: str = '0')[source]

Bond initializer.

Parameters
  • first_atom_number – the index of the first atom forming this bond.

  • second_atom_number – the index of the second atom forming this bond.

  • bond_type – the bond type. (1 = Single, 2 = Double, 3 = Triple, 4 = Aromatic, 5 = Single or Double, 6 = Single or Aromatic, 7 = double or Aromatic 8 = Any)

  • bond_stereo – the bond stereo. (Single bonds: 0 = not stereo, 1 = Up, 4 = Either, 6 = Down; Double bonds: determined by x, y, z coordinates)

  • bond_topology – bond topology. (O = Either, 1 = Ring, 2 = Chain)

  • reacting_center_status – reacting center status.

update_bond_type(bond_type: str) → str[source]

To update the bond type.

Parameters

bond_type – the updated bond type.

Returns

the updated bond type.

update_stereochemistry(stereo: str) → str[source]

To update the bond stereochemistry.

Parameters

stereo – the updated bond stereochemistry.

Returns

the updated bond stereochemistry.

update_first_atom(index: int) → int[source]

To update the first atom number of the bond.

Parameters

index – the updated first atom number.

Returns

the updated first atom number.

update_second_atom(index: int) → int[source]

To update the second atom number of the bond.

Parameters

index – the updated second atom number.

Returns

the updated second atom number.

clone()[source]

To clone the bond.

Returns

the cloned bond.

class md_harmonize.compound.Compound(compound_name: str, atoms: list, bonds: list)[source]

Compound class describes the Compound entity.

Compound initializer.

Parameters
  • compound_name – the compound name.

  • atoms – a list of Atom entities in the compound.

  • bonds – a list of Bond entities in the compound.

__init__(compound_name: str, atoms: list, bonds: list) → None[source]

Compound initializer.

Parameters
  • compound_name – the compound name.

  • atoms – a list of Atom entities in the compound.

  • bonds – a list of Bond entities in the compound.

encode() → tuple[source]

To clone the compound.

Returns

the cloned compound.

property name

To get the compound name.

Returns

the compound name.

static molfile_name(molfile: str)[source]

Create the compound entity based on the molfile representation.

Parameters

molfile – the filename of the molfile.

Returns

the constructed compound entity.

property formula

To construct the formula of this compound (only consider heavy atoms).

Returns

string formula of the compound.

property composition

To get the atom symbols and bond types in the compound.

Returns

the atom and bond information of the compound

property r_groups

To get all the R groups in the compound.

Returns

the list of index of all the R groups.

contains_r_groups() → bool[source]

To check if the compound contains R group(s).

Returns

bool whether the compound contains R group.

has_isolated_atoms() → bool[source]

To check if the compound has atoms that have no connections to other atoms.

Returns

bool whether the compound has isolated atoms.

property metal_index

To get the metal elements in the compound.

Returns

a list of atom numbers of metal elements.

property h_index

To get all H in the compound.

Returns

a list of atom numbers corresponding to H.

property heavy_atoms

To get all the heavy atoms in the compound.

Returns

a list of atom numbers corresponding to heavy atoms.

property index_of_heavy_atoms

To map the atom number to index in the heavy atom list.

Returns

the dictionary of atom number to atom index of heavy atoms.

color_groups(excluded=None) → dict[source]

To update the compound color groups after coloring.

Returns

the dictionary of atom color with the list of atom number.

detect_abnormal_atom() → dict[source]

To find the atoms with invalid bond counts.

Returns

a list of atom numbers with invalid bond counts.

curate_invalid_n() → None[source]

To curate the charge of invalid N atoms.

Returns

None.

update_aromatic_bond_type(cycles: list) → None[source]

Update the aromatic bond types. Two cases: 1) change the bond in the aromatic ring to aromatic bond (bond type = 4); 2) change the double bond connecting to the aromatic ring to single bond.

Parameters

cycles – the list of cycles represented by aromatic atom index.

Returns

None.

extract_double_bond_connecting_cycle(atom_in_cycle: list) → list[source]

To extract the double bonds connecting to the atom in the aromatic cycles.

Parameters

atom_in_cycle – the list of aromatic cycles represented by aromatic atom index.

Returns

the list of outside double bond connecting to the atom in the aromatic cycles.

extract_aromatic_bonds(cycle: list) → list[source]

Extract the aromatic bonds based on the atoms in the cycle.

Parameters

cycle – the list of aromatic cycles represented by aromatic atom index.

Returns

the list of aromatic bonds.

separate_connected_components(bonds: Union[list, set]) → list[source]

This is used in constructing the aromatic substructures detected by the Indigo method. A compound can have several disjoint aromatic substructures. Here, we need to find the disjoint parts. The basic idea is union-find. We union atoms that are connected by a bond.

Parameters

bonds – the list of bonds representing by the atom numbers forming the bond.

Returns

a list of separate components represented by a list atom numbers in the component.

connected_components() → dict[source]

Detect the connected components in the compound structure (using the breadth first search). Cases when not all the atoms are connected together.

Returns

the dictionary of the connected components.

calculate_distance_to_r_groups() → None[source]

To calculate the distance of each atom to its nearest R group (using the dijkstra’s algorithm).

Returns

None:

find_cycles(short_circuit: bool = False, cutoff: int = 40, seconds=50) → list[source]

To find the cycles in the compound.

Parameters
  • short_circuit – whether to take short path.

  • cutoff – limit of cycle length.

  • seconds – the timeout limit.

Returns

the list of cycles in the compound.

find_cycles_helper(short_circuit: bool = False, cutoff: int = 40) → list[source]

Executing function to find the cycles in the compound.

Parameters
  • short_circuit – whether to take short path.

  • cutoff – limit of cycle length.

Returns

the list of cycles in the compound

structure_matrix(resonance: bool = False, backbone: bool = False) → numpy.ndarray[source]

To construct graph structural matrix of this compound. matrix[i][j] = 0 suggests the two atoms are not connected directly. Other integer represented the bond type connecting the two atoms.

Parameters
  • resonance – bool whether to ignore the difference between single and double bonds.

  • backbone – bool whether to ignore bond types. This is for parsing atoms mappings from KEGG RCLASS.

Returns

the constructed structure matrix for this compound.

property distance_matrix

To construct the distance matrix of the compound (using the Floyd Warshall Algorithm). distance[i][j] suggests the distance between atom i and j.

Returns

the distance matrix of the compound.

update_color_tuple(resonance: bool = False) → None[source]

To update the color tuple of the atoms in the compound. This color tuple includes information of its neighboring atoms and bonds. Here, we don’t need to consider backbone since this part was initially designed for aromatic substructure detection and only double and single bonds are considered.

Parameters

resonance – bool whether to ignore the difference between single and double bonds.

Returns

None.

find_mappings(the_other, resonance: bool = True, r_distance: bool = False, backbone: bool = False) → list[source]

Find the one to one atom mappings between two compounds using the BASS algorithm. The other compound is supposed be contained in the self compound.

Parameters
  • the_other – the mappings compound entity.

  • resonance – whether to ignore the difference between single and double bonds.

  • r_distance – whether to take account of the position of R groups.

  • backbone – whether to ignore the bond types.

Returns

the list of atom mappings in the heavy atom order.

find_mappings_reversed(the_other, resonance: bool = True, r_distance: bool = False, backbone: bool = False) → list[source]

Find the one to one atom mappings between two compounds using the BASS algorithm. The self compound is supposed to be contained in the other compound.

Parameters
  • the_other – the mappings compound entity.

  • resonance – whether to ignore the difference between single and double bonds.

  • r_distance – whether to take account of the position of R groups.

  • backbone – whether to ignore the bond types.

Returns

the list of atom mappings in the heavy atom order.

map_resonance(the_other, r_distance: bool = False, seconds: int = 50) → list[source]

Check if the resonant mappings are valid between two compound structures.

Parameters
  • the_other – the mappings compound entity.

  • r_distance – to take account of the position of R groups.

  • seconds – the timeout limit.

Returns

the list of valid atom mappings between the two compound structures.

map_resonance_helper(the_other, r_distance: bool = False) → list[source]

Check if the resonant mappings are valid between the two compound structures. If the mapped atoms don’t share the same local coloring identifier, we check if the difference is caused by the position of double bonds. Find the three atoms involved in the resonant structure and check if one of the atoms is not C. N (a) N (a) / // (b) C N (c) (b) C N (c)

In addition, the self compound is supposed to be more generic, which means has fewer atoms. Therefore, atoms in self compound can all be mapped to the other compound.

Parameters
  • the_other – the mappings compound entity.

  • r_distance – to take account of position of R groups.

Returns

the list of valid atom mappings between the two compound structures.

find_double_bond_linked_atom(i: int) → int[source]

Find the atom that is doubly linked to the target atom i.

Parameters

i – the ith atom in the compound.

Returns

the index of the doubly linked atom.

define_bond_stereochemistry() → None[source]

Define the stereochemistry of double bonds in the compound.

Returns

None.

calculate_bond_stereochemistry(bond: md_harmonize.compound.Bond) → int[source]

Calculate the stereochemistry of the double bond based on its geometric properties. The line of the double bond divides the plane into two parts. For the atoms forming the double bond, it normally has two branches. If the two branches are not the same, we call them heavy side and light side (heavy side containing atoms with heavier atomic weights). We determine the bond stereochemistry by checking if the two heavy sides lie on the same part of the divided plane.

Parameters

bond – the bond entity.

Returns

the calculated bond stereochemistry.

static calculate_y_coordinate(slope: float, b: float, atom: md_harmonize.compound.Atom) → float[source]

Calculate the y coordinate of the atom based on the linear function: y = slope * x + b

Parameters
  • slope – the slope of the targeted line.

  • b – the intercept of the targeted line.

  • atom – the atom entity.

Returns

the calculated y coordinate.

collect_atomic_weights_of_neighbors(neighbors: list) → list[source]

To collect the atomic weights of the current layer’s neighbors.

Parameters

neighbors – the list of atom numbers of neighbors.

Returns

the list of atomic weights for this layer’s neighbors.

compare_branch_weights(neighbors: list, atom_forming_double_bond: md_harmonize.compound.Atom) → tuple[source]

To determine the heavy and light branches that connect to the atom forming the double bond. This is based on comparison of the atomic weights of the two branches (breadth first algorithm).

Parameters
  • neighbors – the list of atom numbers of the atoms that connect the atom forming the double bond.

  • atom_forming_double_bond – the atom that forms the bond.

Returns

heavy and light branches. [heavy_side, light_side]

get_next_layer_neighbors(cur_layer_neighbors: list, visited: set, excluded: list = None) → list[source]

To get the next layer’s neighbors.

Parameters
  • cur_layer_neighbors – the list of atom numbers of the current layer.

  • visited – the atom numbers that have already been visited.

  • excluded – the list of atom numbers that should not be included in the next layer.

Returns

the neighboring atom numbers of the next layer.

color_compound(r_groups: bool = True, bond_stereo: bool = False, atom_stereo: bool = False, resonance: bool = False, isotope_resolved: bool = False, charge: bool = False, backbone: bool = False) → None[source]

To color the compound.

Parameters
  • r_groups – If true, add R groups in the coloring.

  • bond_stereo – If true, add bond stereo detail when constructing colors.

  • atom_stereo – If true, add atom stereo detail when constructing colors.

  • resonance – If true, ignore the difference between double and single bonds.

  • isotope_resolved – If true, add isotope detail when constructing colors.

  • charge – If true, add charge detail when constructing colors.

  • backbone – If true, ignore bond types in the coloring.

Returns

None.

reset_color() → None[source]

To set the color of atoms in the compound to be empty.

Returns

None:

generate_atom_zero_layer_color(isotope_resolved: bool = False, charge: bool = False, atom_stereo: bool = False) → None[source]

To generate the color identifier of zero layer for each atom. We don’t consider H and metals here.

Parameters
  • isotope_resolved – If true, add isotope detail when constructing colors.

  • charge – If true, add charge detail when constructing colors.

  • atom_stereo – If true, add atom stereochemistry detail when constructing colors.

Returns

None.

generate_atom_color_with_neighbors(atom_index: list, excluded: list = None, zero_core_color: bool = True, zero_neighbor_color: bool = True, resonance: bool = False, bond_stereo: bool = False, backbone: bool = False) → dict[source]

To generate the atom color with its neighbors. We add this color name when we try to incorporate neighbors’ information in naming.

Here, we don’t need to care about the atom stereo. It has been taken care of in generating color_0.

Basic color formula: atom.color + [neighbor.color + bond.bond_type]

Parameters
  • atom_index – indices of atoms to color.

  • excluded – the list of atom indices will be excluded from coloring.

  • zero_core_color – If true, we use the atom.color_0 else atom.color for the core atom (first round coloring vs validation).

  • zero_neighbor_color – If true, we use the atom.color_0 else atom.color for the neighbor atoms (first round coloring vs validation).

  • resonance – If true, detect resonant compound pairs without distinguishing between double and single bonds.

  • bond_stereo – If true, add stereo detail of bonds when constructing colors.

  • backbone – If true, ignore bond types in the coloring.

Returns

the dictionary of atom index and its color name.

first_round_color(atoms_to_color: list, excluded_index: list = None, bond_stereo: bool = False, resonance: bool = False, backbone: bool = False, depth: int = 5000) → None[source]

To do the first round of coloring this compound. We add neighbors’ information layer by layer to the atom’s color identifier until it has a unique identifier or all the atoms in the compound have been used for naming (based on the breadth first search algorithm).

Parameters
  • atoms_to_color – the list of atom numbers to be colored.

  • excluded_index – the list of atom numbers to be excluded from coloring.

  • bond_stereo – If true, add bond stereo detail when constructing colors.

  • resonance – If true, ignore the difference between double and single bonds.

  • backbone – If true, ignore bond types in the coloring.

  • depth – the max depth of coloring.

Returns

None.

invalid_symmetric_atoms(atoms_to_color: list, excluded_index: bool = None, bond_stereo: bool = False, resonance: bool = False, backbone: bool = False) → list[source]

To check if atoms with the same color identifier are symmetric.

Parameters
  • atoms_to_color – the list of atom numbers to be colored.

  • excluded_index – the list of atom numbers to be excluded from coloring.

  • bond_stereo – If true, add bond stereo detail when constructing colors.

  • resonance – If true, ignore the difference between double and single bonds.

  • backbone – If true, ignore bond types in the coloring.

Returns

the list of atom numbers to be recolored.

curate_invalid_symmetric_atoms(atoms_to_color: list, excluded_index: list = None, bond_stereo: bool = False, resonance: bool = False, backbone: bool = False) → None[source]

To curate the color identifiers of invalid symmetric atoms. We recolor those invalid atoms using the full color identifiers of its neighbors layer by layer until the difference can be captured.

Parameters
  • atoms_to_color – the list of atom numbers of atoms to be colored.

  • excluded_index – the list of atom numbers of atoms to be excluded from coloring.

  • bond_stereo – If true, add stereo information to bonds when constructing colors.

  • resonance – If true, ignore the difference between double bonds and single bonds.

  • backbone – If true, ignore bond types in the coloring.

Returns

None.

color_metal(bond_stereo: bool = False, resonance: bool = True, backbone: bool = False) → None[source]

To color the metals in the compound. Here we just incorporate information of directly connected atoms.

Parameters
  • bond_stereo – If true, add bond stereo detail when constructing colors.

  • resonance – If true, ignore difference between double and single bonds.

  • backbone – If true, ignore the bond types.

Returns

None.

color_h(bond_stereo: bool = False, resonance: bool = True, backbone: bool = False) → None[source]

To color the H in the compound. Here we just incorporate information of directly connected atoms.

Parameters
  • bond_stereo – If true, add bond stereo detail when constructing colors.

  • resonance – If true, ignore difference between double and single bonds.

  • backbone – If true, ignore bond types.

Returns

None.

metal_color_identifier(details: bool = True) → str[source]

To generate the metal coloring string representation.

Parameters

details – if true, to use full metal color when constructing identifier.

Returns

the metal coloring string representation.

h_color_identifier(details: bool = True) → str[source]

To generate the H coloring string representation.

Parameters

details – if true, use the full H color when constructing identifier.

Returns

the H coloring string representation.

backbone_color_identifier(r_groups: bool = False) → str[source]

To generate the backbone coloring string representation for this compound. Exclude Hs and metals.

Parameters

r_groups – whether to include the R group.

Returns

the coloring string representation for this compound.

get_chemical_details(excluded: list = None) → list[source]

To get the chemical details of the compound, which include the atom stereo chemistry and bond stereo chemistry. This is to compare the compound with the same structures (or the same color identifiers).

Parameters

excluded – a list of atom indices to be ignored.

Returns

the list of chemical details in the compound.

static compare_chemical_details(one_chemical_details: list, the_other_chemical_details: list) → tuple[source]

To compare the chemical details of the two compounds.

Then return the relationship between the two compounds.

The relationship can be equivalent, generic-specific and loose, represented by 0, (-1, 1), 2

Parameters
  • one_chemical_details – the chemical details of one compound.

  • the_other_chemical_details – the chemical details of the other compound.

Returns

the relationship between the two structures and the count of chemical details that cannot be mapped.

same_structure_relationship(the_other_compound) → tuple[source]

To determine the relationship of two compounds with the same structure.

Parameters

the_other_compound – the other Compound entity.

Returns

the relationship and the atom mappings between the two compounds.

generate_atom_mapping_by_atom_color(the_other_compound) → dict[source]

To generate the atom mappings between the two compounds.

Assume the two compounds have the same structure, so we can achieve atom mappings through atom colors.

Parameters

the_other_compound – the other Compound entity.

Returns

the atom mappings between the two compounds.

optimal_resonant_mapping(the_other_compound, mappings: list) → tuple[source]

To find the optimal atom mappings for compound pairs that are resonant type.

Parameters
  • the_other_compound – the other Compound entity.

  • mappings – the list of atom mappings between the two compounds detected by BASS.

Returns

the relationship and the atom mappings between the two compounds.

static determine_relationship(unmapped_count: dict) → int[source]

To determine the relationship between two compounds when there are multiple possible atom mappings.

We try to map as many details as possible.

0: equivalent; 1: self is more generic than the other compound; -1: the other compound is more generic than self; 2: either has chemical detail(s) that the other compound does not have.

Parameters

unmapped_count – the dictionary of relationship to the count of details that cannot be mapped.

Returns

the relationship between the two compounds.

circular_pair_relationship(other_compound, seconds: int = 50) → tuple[source]

To determine the relationship of two compounds with interchangeable circular and linear representations with time limit.

Parameters
  • other_compound – the other Compound entity.

  • seconds – the timeout limit.

Returns

the relationship and the atom mappings between the two compounds.

circular_pair_relationship_helper(other_compound) → tuple[source]

To determine the relationship of two compounds with interchangeable circular and linear representations. We first find the critical atoms that involve in the formation of the ring. There can be several possibilities. Then we break the ring, and restore the double bond in the aldehyde group that forms the ring. Finally, check if the updated structure is the same with the other compound. And determine the relationship between the two compounds as well as generate the atom mappings.

Parameters

other_compound – the other Compound entity.

Returns

the relationship and the atom mappings between the two compounds.

break_cycle(critical_atoms: int) → None[source]

To break the cycle caused by aldol reaction, which often occurs in the sugar. Two steps are involved: 1) remove the neighbors. 2) restore the double bond in the aldehyde group.

Parameters

critical_atoms – the three critical atoms that are involved in the ring formation.

Returns

None.

restore_cycle(critical_atoms: list) → None[source]

To restore the ring caused by aldol reaction. The reverse process of break_cycle.

Parameters

critical_atoms – the three atoms are involved in the aldol reaction.

Returns

None.

find_critical_atom_in_cycle() → list[source]

To find the C (atom_c) and O (atom_oo) in aldehyde group, as well as O (atom_o) in the hydroxy that are involved in the ring formation. We need to break the bond between the atom_c and atom_o to form the linear transformation. Please check one example of aldol reaction in the sugar if the description is not confusing.

Returns

the list of critical atoms.

update_atom_symbol(index: list, updated_symbol: str) → None[source]

To update the atom symbols. This is often used to remove/restore R group.

Parameters
  • index – the atom symbols of these indices to be updated.

  • updated_symbol – the updated symbol.

Returns

None.

validate_mapping_with_r(other_compound, one_rs: list, mapping: dict) → bool[source]

To validate the atom mappings with r groups. Here are two things we need to pay attention to:

  1. For the generic compound, the R group can be mapped to a branch or just H in the specific compound.

  2. For the specific compound, every unmatched branch needs to correspond to an R group in the generic compound.

In other words, the generic compound can have extra R groups that have no matched branch, but the specific compound cannot have unmatched branches that don’t correspond to any R groups.

For the specific validation:

1) We find all the linkages of R group and mapped atom in the compound, represented by the corresponding atom number in the other compound and the bond type. (We used the corresponding atom number in the other compound for the next comparison of the R linkages in the two compounds.

2) For every mapped atom in the other compound, we need to find if it has neighbors that are not mapped. Then the atom should be linked to a R group. We represent the linkage by the atom number and the bond type.

3) Based on the above validation criteria, we have to make sure that the R linkages in the other compound is the subset of the R linkages in this compound.

Parameters
  • other_compound – the other Compound entity.

  • one_rs – the R groups in the compound.

  • mapping – the atom mappings between the mapped parts of the two compounds.

Returns

bool whether the atom mappings are valid.

compare_chemical_details_with_mapping(other_compound, mapping: dict) → tuple[source]

To compare the chemical details of mapped atoms of the two compounds. This part targets compound pairs with resonance or r_group type. Only parts of chemicals need to be checked. 1) atoms are not involved in resonance part or connected to R groups (both cases can be tested by the first layer atom coloring identifier). 2) bond are formed by the atoms described above.

Parameters
  • other_compound – the other Compound entity.

  • mapping – the mapped atoms between the two compounds.

Returns

the count of chemical details that cannot be mapped.

optimal_mapping_with_r(other_compound, one_rs: list, mappings: list) → tuple[source]

To find the optimal mappings of compound pairs belonging to r_group type. In this case, multiple valid mappings can exist. We need to find the optimal one with the minimal unmapped chemical details. And the unmapped chemical details can exist in both compounds (generic or specific). The unmapped chemical details will determine the relationship of the compound pair. The priority: generic-specific, loose. The relationship cannot be equivalent.

Parameters
  • other_compound – the other Compound entity.

  • one_rs – the list of R groups in the compound.

  • mappings – the atom mappings of the mapped parts in the two compounds.

Returns

the relationship and atom mappings between the two compounds.

with_r_pair_relationship(other_compound, seconds: int = 50) → tuple[source]

To find the relationship and the atom mappings between the two compounds that have r_groups type with a time limit.

Parameters
  • other_compound – the other Compound entity.

  • seconds – the timeout limit.

Returns

the relationship and the atom mappings between the two compounds.

with_r_pair_relationship_helper(other_compound) → tuple[source]

To find the relationship and the atom mappings between the two compounds that have r_groups type. Several steps are involved:

1) Ignore the R groups in the two compounds and find if one compound (generic compound) is included in the other compound (specific compound).

2) If we can find the mappings, then we need to validate the mappings with the validate_mapping_with_r function.

3)Then we get the optimal atom mappings of the mapped parts.

4) We need to map the unmatched branches in the specific compound to the corresponding R group in the generic compound.

Parameters

other_compound – the other Compound entity.

Returns

the relationship and the atom mappings between the two compounds.

map_r_correspondents(one_rs: list, other_compound, mappings: dict) → dict[source]

To map the unmatched branches in the specific compound to the corresponding R group in the generic compound.

Parameters
  • one_rs – the list of R groups in the compound.

  • other_compound – the other Compound entity.

  • mappings – the atom mappings of the mapped parts in the two compounds.

Returns

the full atom mappings between the two compounds.

md_harmonize.reaction

This module provides the Reaction class entity.

class md_harmonize.reaction.Reaction(reaction_name: str, one_side: list, other_side: list, ecs: dict, atom_mappings: list, coefficients: dict)[source]

Reaction class describes the Reaction entity.

Reaction initializer.

Parameters
  • reaction_name – the reaction name.

  • one_side – the list of Compound entities in one side of the reaction.

  • other_side – the list of Compound entities in the other side of the reaction.

  • ecs – the dict of Enzyme Commission numbers (EC numbers) of the reaction.

  • atom_mappings – the list of atom mappings between two sides of the reaction.

  • coefficients – the dictionary of compound names and their corresponding coefficients in the reaction.

__init__(reaction_name: str, one_side: list, other_side: list, ecs: dict, atom_mappings: list, coefficients: dict) → None[source]

Reaction initializer.

Parameters
  • reaction_name – the reaction name.

  • one_side – the list of Compound entities in one side of the reaction.

  • other_side – the list of Compound entities in the other side of the reaction.

  • ecs – the dict of Enzyme Commission numbers (EC numbers) of the reaction.

  • atom_mappings – the list of atom mappings between two sides of the reaction.

  • coefficients – the dictionary of compound names and their corresponding coefficients in the reaction.

property name

To get the reaction name.

Returns

the reaction name.

md_harmonize.KEGG_database_scraper

This module provides functions to download KEGG data (including compound, reaction, kcf, and rclass) from the KEGG (REST) API.

The URLs can change.

md_harmonize.KEGG_database_scraper.entry_list(target_url: str) → list[source]

To get the list of entity name to download.

Parameters

target_url – the url to fetch.

Returns

the list of entry names.

md_harmonize.KEGG_database_scraper.update_entity(entries: list, sub_directory: str, directory: str, suffix: str = '') → None[source]

To download the KEGG entity (compound, reaction, or rclass) and save it into a file.

Parameters
  • entries – the list of entry names to download.

  • sub_directory – the subdirectory to save the downloaded file.

  • directory – the main directory to save the downloaded file.

  • suffix – the suffix needed for download, like the mol for compound molfile and kcf for compound kcf file.

Returns

None.

md_harmonize.KEGG_database_scraper.curate_molfile(file_path: str) → None[source]

To curate the molfile representation.

Parameters

file_path – the path to the molfile.

Returns

None.

md_harmonize.KEGG_database_scraper.download(directory: str) → None[source]

To download all the KEGG required files.

Parameters

directory – the directory to store the data.

Returns

None.

md_harmonize.KEGG_parser

This module provides functions to parse KEGG data (including compound, reaction, kcf, and rclass).

md_harmonize.KEGG_parser.kegg_data_parser(data: list) → dict[source]

This is to parse KEGG data (reaction, rclass, compound) file to a dictionary.

eg:

ENTRY R00259 Reaction

NAME acetyl-CoA:L-glutamate N-acetyltransferase

DEFINITION Acetyl-CoA + L-Glutamate <=> CoA + N-Acetyl-L-glutamate

EQUATION C00024 + C00025 <=> C00010 + C00624

RCLASS RC00004 C00010_C00024

RC00064 C00025_C00624

ENZYME 2.3.1.1

PATHWAY rn00220 Arginine biosynthesis

rn01100 Metabolic pathways

rn01110 Biosynthesis of secondary metabolites

rn01210 2-Oxocarboxylic acid metabolism

rn01230 Biosynthesis of amino acids

MODULE M00028 Ornithine biosynthesis, glutamate => ornithine

M00845 Arginine biosynthesis, glutamate => acetylcitrulline => arginine

ORTHOLOGY K00618 amino-acid N-acetyltransferase [EC:2.3.1.1]

K00619 amino-acid N-acetyltransferase [EC:2.3.1.1]

K00620 glutamate N-acetyltransferase / amino-acid N-acetyltransferase [EC:2.3.1.35 2.3.1.1]

K11067 N-acetylglutamate synthase [EC:2.3.1.1]

K14681 argininosuccinate lyase / amino-acid N-acetyltransferase [EC:4.3.2.1 2.3.1.1]

K14682 amino-acid N-acetyltransferase [EC:2.3.1.1]

K22476 N-acetylglutamate synthase [EC:2.3.1.1]

K22477 N-acetylglutamate synthase [EC:2.3.1.1]

K22478 bifunctional N-acetylglutamate synthase/kinase [EC:2.3.1.1 2.7.2.8]

DBLINKS RHEA: 24295

///

Parameters

data – the KEGG reaction description.

Returns

the dictionary of parsed KEGG data.

md_harmonize.KEGG_parser.parse_equation(equation: str) → tuple[source]

This is to parse the KEGG reaction equation.

eg: C00029 + C00001 + 2 C00003 <=> C00167 + 2 C00004 + 2 C00080

Parameters

equation – the equation string.

Returns

the parsed KEGG reaction equation.

md_harmonize.KEGG_parser.kegg_kcf_parser(kcf: list) → dict[source]

This is to parse KEGG kcf file to a dictionary.

eg:

ENTRY C00013 Compound

ATOM 9

1 P1b P 22.2269 -20.0662

2 O2c O 23.5190 -20.0779

3 O1c O 21.0165 -20.0779

4 O1c O 22.2851 -21.4754

5 O1c O 22.2617 -18.4642

6 P1b P 24.8933 -20.0837

7 O1c O 24.9401 -21.4811

8 O1c O 26.1797 -20.0662

9 O1c O 24.9107 -18.4582

BOND 8

1 1 2 1

2 1 3 1

3 1 4 1

4 1 5 2

5 2 6 1

6 6 7 1

7 6 8 1

8 6 9 2

///

Parameters

kcf – the kcf text.

Returns

the dictionary of parsed kcf file.

class md_harmonize.KEGG_parser.reaction_center(i, kat, label, match, difference)

Create new instance of reaction_center(i, kat, label, match, difference)

__getnewargs__()

Return self as a plain tuple. Used by copy and pickle.

static __new__(_cls, i, kat, label, match, difference)

Create new instance of reaction_center(i, kat, label, match, difference)

__repr__()

Return a nicely formatted representation string

difference

Alias for field number 4

i

Alias for field number 0

kat

Alias for field number 1

label

Alias for field number 2

match

Alias for field number 3

class md_harmonize.KEGG_parser.RpairParser(rclass_name: str, rclass_definitions: list, one_compound: md_harmonize.compound.Compound, other_compound: md_harmonize.compound.Compound)[source]

This is to get one-to-one atom mappings between two compounds based on the rclass definition.

Several steps are involved in this process:

1. The rclass definition can have several pieces. Each piece describes a center atom (R) and its connected atoms. The connected atoms can stay the same (M) or change (D) between the two compound structures.

  1. First we need to find the center atoms based on the rclass descriptions.

3. For each center atom, there are can multiple candidates. In other words, based on the RDM description, a bunch of atoms in the compound can meet the descriptions. (One simple case are the symmetric compounds).

  1. Therefore, we need to generate the all the combinations for the center atoms in a compound.

    eg: if there are three atom centers, each center has several candidates:

    center 1: [0, 1, 2]; center 2: [5, 6]; center 3: [10, 11]

    The combinations for the center atoms:

    [0, 5, 10], [0, 5, 11], [0, 6, 10], [0, 6, 11], [1, 5, 10], [1, 5, 11], [1, 6, 10], [1, 6, 11], [2, 5, 10], [2, 5, 11], [2, 6, 10], [2, 6, 11]

  2. Next, we need to find the one-to-one atom mappings between the two compounds based on the mapped center atoms.

6. To solve this issue, we first disassemble each compound into different components. This is due to the difference atoms in the two compounds, i.e. broken bonds.

7. Then we need to find the mappings between each disassembled component, and concatenate the mappings of all the components.

8. To find the one-to-one atom mappings, we use the BASS algorithm. We assume the mapped component have the same structure since we have already removed the different parts. However, here we only map the backbone of the structure (in other words, we simply all the bond type to 1) due to bond change (double bond to single bond or triple bond to single bond)

9. To ensure the optimal mappings, we count the mapped atoms with changed local environment and choose the mapping with minimal changes.

RpairParser initializer.

Parameters
  • rclass_name – the rclass name.

  • rclass_definitions – a list of rclass definitions.

  • one_compound – one compound involved in the pair.

  • other_compound – the other compound involved in the pair.

__init__(rclass_name: str, rclass_definitions: list, one_compound: md_harmonize.compound.Compound, other_compound: md_harmonize.compound.Compound)[source]

RpairParser initializer.

Parameters
  • rclass_name – the rclass name.

  • rclass_definitions – a list of rclass definitions.

  • one_compound – one compound involved in the pair.

  • other_compound – the other compound involved in the pair.

map_atom_by_colors() → dict[source]

Roughly map the atoms between the two compounds by the atom color.

Returns

the dict of mapped atom index between the two compounds.

map_whole_compound() → dict[source]

Map two compounds if the two compounds can be roughly mapped by the atom color.

Returns

the dict of mapped atom in the two compounds.

static generate_kat_neighbors(this_compound: md_harmonize.compound.Compound) → list[source]

Generate the atom neighbors represented by KEGG atom type for each atom in the compound. This is used to find the center atom. We used KEGG atom type since the descriptions of atoms in the rclass definitions using KEGG atom type.

Parameters

this_compound – the compound entity.

Returns

the list of atom with its neighbors.

static find_target_atom(atoms: list, target: tuple) → list[source]

Find the target atoms from a list of candidate atoms.

Parameters
  • atoms – a list of atoms to search from.

  • target – the target atom to be searched.

Returns

the list of atom numbers that match the target atom.

static create_reaction_centers(i: int, kat: str, difference: list, the_other_difference: list, match: list, the_other_match: list) → collections.namedtuple[source]

Create the center atom based on its connected atoms and its counterpart atom in the other compound.

Parameters
  • i – the ith rclass definition.

  • kat – the KEGG atom type of the center atom.

  • difference – the list of KEGG atom type of different connected atoms.

  • the_other_difference – the list of KEGG atom type of different connected atoms in the other compound.

  • match – the list of KEGG atom type of the matched connected atoms.

  • the_other_match – the list of KEGG atom type of the matched connected atoms in the other compound.

Returns

the constructed reaction center.

find_center_atoms() → tuple[source]

Example of rclass definition:

C8x-C8y:*-C1c:N5y+S2x-N5y+S2x

The RDM pattern is defined as KEGG atom type changes at the reaction center (R), the difference region (D), and the matched region (M) for each reactant pair. It characterizes chemical structure transformation patterns associated with enzymatic reactions.

Returns

the list of reaction centers and the corresponding candidate atoms.

static get_center_list(center_atom_index: list) → list[source]

Generate all the combinations of reaction centers.

Parameters

center_atom_index – list of atom index list for each reaction centers. eg: three reaction centers: [[0, 1, 2], [5, 6], [10, 11]].

Returns

the list of combined reaction centers. eg: [[0, 5, 10], [0, 5, 11], [0, 6, 10], [0, 6, 11], [1, 5, 10], [1, 5, 11], [1, 6, 10], [1, 6, 11], [2, 5, 10], [2, 5, 11], [2, 6, 10], [2, 6, 11]]

static remove_different_bonds(this_compound: md_harmonize.compound.Compound, center_atom_numbers: list, reaction_centers: list) → list[source]

Remove the bonds connecting to different atoms. For each reaction center, multiple atoms can be the different atoms. We need to get all the combinations.

Parameters
  • this_compound – the Compound entity.

  • center_atom_numbers – the list of atom numbers for center atom in the compound.

  • reaction_centers – the list of reaction center descriptions for the compound.

Returns

the list of bonds (represented by the atom numbers in the bond) that needs to be removed based on the RDM descriptions.

generate_atom_mappings() → list[source]

Generate the one-to-one atom mappings of the compound pair.

Returns

the list of atom mappings.

static detect_components(this_compound: md_harmonize.compound.Compound, removed_bonds: list, center_atom_numbers: list) → list[source]

Detect all the components in the compound after removing some bonds. Basic idea is the breadth first search algorithm.

Parameters
  • this_compound – the Compound entity.

  • removed_bonds – the list of removed bonds (represented by the atom numbers in the bond) in the compound.

  • center_atom_numbers – the list of atom numbers of the center atoms in the compound.

Returns

the list of components of the compound represented by a list of atom numbers.

static pair_components(left_components: list, right_components: list) → list[source]

The two compounds are divided into separate components due to the difference atoms. We need to pair each component in one compound to its counterpart component in the other compound. Here we roughly pair the components based on the number of atoms in the component. Therefore, every component in one compound can be paired with several components in the other compound.

Parameters
  • left_components – the components in one compound.

  • right_components – the components in the other compound.

Returns

the list of paired components.

static construct_component(this_compound: md_harmonize.compound.Compound, atom_numbers: list, removed_bonds: list) → md_harmonize.compound.Compound[source]

Construct a Compound entity for the component based on the atom index and removed bonds, facilitating the following atom mappings.

Parameters
  • this_compound – the Compound entity.

  • atom_numbers – the list of atom numbers in the component.

  • removed_bonds – the list of removed bonds (represented by the atom numbers in the bond) in the compound.

Returns

the constructed component compound.

static preliminary_atom_mappings_check(left_component: md_harmonize.compound.Compound, right_component: md_harmonize.compound.Compound) → bool[source]

Roughly evaluate if the atoms between the two components can be mapped. We compare if the every atom color in the left component has its counterpart in the right component. Here, we only consider the backbone of the structure.

Parameters
  • left_component – the component in one compound.

  • right_component – the component in the other compound.

Returns

bool whether the atoms in the two components can be mapped.

map_components(left_removed_bonds: list, right_removed_bonds: list, left_centers: list, right_centers: list) → dict[source]

Find optimal map for every component in the compound pair.

Parameters
  • left_removed_bonds – the list of removed bonds in one compound.

  • right_removed_bonds – the list of removed bonds in the other compound.

  • left_centers – the list of atom numbers of the center atoms in the one compound.

  • right_centers – the list of atom numbers of the center atoms in the other compound.

Returns

the atom mappings for the compound pair based on the removed bonds and center atoms.

combine_atom_mappings(atom_mappings: list) → dict[source]

Combine the atom mappings of all the components. We just mentioned in the pair_components function that every component can have several mappings. Here, we choose the optimal mapping with the least count of changed atom local identifier. And make sure that each atom can only to mapped once.

Parameters

atom_mappings – the list of atom mappings for all the components.

Returns

the atom mappings for the compound pair.

static validate_component_atom_mappings(left_centers: list, right_centers: list, component_atom_mappings: dict) → bool[source]

Check if mapped the atoms can correspond to the mapped reaction center atoms.

Parameters
  • left_centers – the list of center atom indices in the left compound.

  • right_centers – the list of center atom indices in the right compound.

  • component_atom_mappings – the one to one atom mappings of one component.

Returns

bool whether the mappings are valid.

count_changed_atom_identifiers(one_to_one_mappings: dict) → int[source]

Count the mapped atoms with changed local atom identifier. The different atoms (D in RCLASS definitions) can cause change of local environment, which can change the atom identifier.

Parameters

one_to_one_mappings – the dictionary of atom mappings between the two compounds.

Returns

the total number of mapped atoms with different local identifiers.

md_harmonize.KEGG_parser.create_compound_kcf(kcf_file: str) → Optional[md_harmonize.compound.Compound][source]

Construct compound entity based on the KEGG kcf file.

Parameters

kcf_file – the path to the kcf file.

Returns

the constructed compound entity.

md_harmonize.KEGG_parser.create_reactions(reaction_directory: str, atom_mappings: dict) → list[source]

Create KEGG Reaction entities.

Parameters
  • reaction_directory – the directory that stores all the reaction files.

  • atom_mappings – the compound pair name and its atom mappings.

Returns

the constructed Reaction entities.

md_harmonize.KEGG_parser.compound_pair_mappings(pair_component: tuple) → tuple[source]

Get the atom mappings between two compounds based on the rclass definitions.

Parameters

pair_component – a tuple containing the rclass_name, rclass_definitions, one_compound and the_other_compound.

Returns

the compound pair name and its atom mappings.

md_harmonize.KEGG_parser.create_atom_mappings(rclass_directory: str, compounds: dict, seconds: int = 1200) → dict[source]

Generate the atom mappings between compounds based on RCLASS definitions.

Parameters
  • rclass_directory – the directory that stores the rclass files.

  • compounds – a dictionary of Compound entities.

  • seconds – the timeout limit.

Returns

the atom mappings of compound pairs.

md_harmonize.MetaCyc_parser

This module provides functions to parse MetaCyc text data.

Note: All MetaCyc reactions atom_mappings are stored in a single text file.

md_harmonize.MetaCyc_parser.reaction_side_parser(reaction_side: str) → dict[source]

This is to parse FROM_SIDE or TO_SIDE in the reaction.

eg: FROM-SIDE - (CPD-9147 0 8) (OXYGEN-MOLECULE 9 10)

Information includes compound name and the start and end atom index in this compound used for atom mappings. The order of the atoms are the orders in the compound molfile.

Parameters

reaction_side – the text description of reaction side.

Returns

the dictionary of compounds and the corresponding start and end atom index in the atom mappings.

md_harmonize.MetaCyc_parser.generate_one_to_one_mappings(from_side: dict, to_side: dict, indices: str) → list[source]

To generate the one to one atom mappings between two the sides of a metabolic reaction.

Parameters
  • from_side – the dictionary of compounds with their corresponding start and end atom indices in the from_side.

  • to_side – the dictionary of compounds with their corresponding start and end atom indices in the to_side.

  • indices – the string representation of mapped atoms.

Returns

the list of mapped atoms between the two sides (from_index, to_index).

md_harmonize.MetaCyc_parser.atom_mappings_parser(atom_mapping_text: list) → dict[source]

This is to parse the MetaCyc reaction with atom mappings.

eg:

REACTION - RXN-11981

NTH-ATOM-MAPPING - 1

MAPPING-TYPE - NO-HYDROGEN-ENCODING

FROM-SIDE - (CPD-12950 0 23) (WATER 24 24)

TO-SIDE - (CPD-12949 0 24)

INDICES - 0 1 2 3 5 4 7 6 9 10 11 13 12 14 15 16 17 8 18 19 21 20 22 24 23

note: the INDICES are atom mappings between two sides of the reaction. TO-SIDE[i] is mapped to FROM-SIDE[idx] for i, idx in enumerate(INDICES). Pay attention to the direction!

Parameters

atom_mapping_text – the text descriptions of reactions with atom mappings.

Returns

the dictionary of reactions with atom mappings.

md_harmonize.MetaCyc_parser.reaction_parser(reaction_text: list) → dict[source]

This is used to parse MetaCyc reaction.

eg:

UNIQUE-ID - RXN-13583

TYPES - Redox-Half-Reactions

ATOM-MAPPINGS - (:NO-HYDROGEN-ENCODING (1 0 2) (((WATER 0 0) (HYDROXYLAMINE 1 2)) ((NITRITE 0 2))))

CREDITS - SRI

CREDITS - caspi

IN-PATHWAY - HAONITRO-RXN

LEFT - NITRITE

^COMPARTMENT - CCO-IN

LEFT - PROTON

^COEFFICIENT - 5

^COMPARTMENT - CCO-IN

LEFT - E-

^COEFFICIENT - 4

ORPHAN? - :NO

PHYSIOLOGICALLY-RELEVANT? - T

REACTION-BALANCE-STATUS - :BALANCED

REACTION-DIRECTION - LEFT-TO-RIGHT

RIGHT - HYDROXYLAMINE

^COMPARTMENT - CCO-IN

RIGHT - WATER

^COMPARTMENT - CCO-IN

STD-REDUCTION-POTENTIAL - 0.1

//

Parameters

reaction_text – the text descriptions of MetaCyc reactions.

Returns

the dict of parsed MetaCyc reactions.

md_harmonize.MetaCyc_parser.create_reactions(reaction_file: str, atom_mapping_file: str) → list[source]

To molfile_name MetaCyc reaction entities.

Parameters
  • reaction_file – the path to the reaction file.

  • atom_mapping_file – the path to the atom mapping file.

Returns

the list of constructed Reaction entities.

md_harmonize.aromatics

This module provides the AromaticManager class entity.

class md_harmonize.aromatics.AromaticManager(aromatic_substructures: list = None)[source]

Two major functions are implemented in AromaticManager.

  1. Extract aromatic substructures based on labelled aromatic atoms (mainly C and N) or Indigo detected aromatic bonds.

    The first case only applies to KEGG compounds, and the second case applies to compounds from any databases.

  2. Detect the aromatic substructures in any given compound, and update the bond type of the detected aromatic bonds.

AromaticManager initializer.

Parameters

aromatic_substructures – a list of aromatic substructures.

__init__(aromatic_substructures: list = None) → None[source]

AromaticManager initializer.

Parameters

aromatic_substructures – a list of aromatic substructures.

encode() → list[source]

To encode the aromatic substructures in the aromatic manager. (Get error when try to jsonpickle the AromaticManager: the cythonized entities cannot be pickled.)

Returns

the list of aromatic substructures.

static decode(aromatic_structures: list)[source]

Construct the AromaticManager based on the aromatic substructures.

Parameters

aromatic_structures – the list of aromatic substructures.

Returns

the constructed AromaticManager.

add_aromatic_substructures(substructures: list) → None[source]

Add newly detected aromatic structures to the manager. Make sure no duplicates in the aromatic substructures.

Parameters

substructures – a list of aromatic substructures.

Returns

None.

kegg_aromatize(kcf_cpd: md_harmonize.compound.Compound) → None[source]

Extract aromatic substructures based on KEGG atom type in KEGG compound parsed from KCF file, and add the newly detected aromatic substructures to the AromaticManager.

Parameters

kcf_cpd – the KEGG compound entity derived from KCF file.

Returns

None.

indigo_aromatize(molfile: str) → None[source]

Extract aromatic substructures via Indigo, and add the newly detected aromatic substructures to the AromaticManager.

Parameters

molfile – the path to the molfile.

Returns

None.

indigo_aromatic_bonds(molfile: str) → set[source]

Detect the aromatic bonds in the compound via Indigo method.

Parameters

molfile – the path to the molfile.

Returns

the set of aromatic bonds represented by first_atom_number and second_atom_number in the bond.

static fuse_cycles(cycles: list) → list[source]

To fuse the cycles with shared atoms.

Parameters

cycles – the list of cycles represented by atom numbers.

Returns

the list of cleaned cycles.

detect_aromatic_substructures_timeout(cpd: md_harmonize.compound.Compound) → None[source]

Detect the aromatic substructures in the compound and stop the search on timeout.

Parameters

cpd – the Compound entity.

Returns

None.

detect_aromatic_substructures(cpd: md_harmonize.compound.Compound) → None[source]

Detect all the aromatic substructures in the cpd, and update the bond type of aromatic bonds.

Parameters

cpd – the Compound entity.

Returns

None.

static construct_aromatic_entity(cpd: md_harmonize.compound.Compound, aromatic_cycles: list) → list[source]

Construct the aromatic substructure entity based on the aromatic atoms. Here, we also include outside atoms that are connected to aromatic rings with double bonds.

Parameters
  • cpd – the Compound entity.

  • aromatic_cycles – the list of aromatic cycles represented by atom numbers in the compound.

Returns

the list of constructed aromatic substructures.

extract_aromatic_substructures(cpd: md_harmonize.compound.Compound) → list[source]

Detect the aromatic substructures in a compound based on the aromatic atoms. This only applies to KEGG kcf file.

Parameters

cpd – the Compound entity.

Returns

the list of aromatic cycles represented by atom numbers.

md_harmonize.harmonization

This module provides the HarmonizedEdge class, the HarmonizedCompoundEdge class, and the HarmonizedReactionEdge class .

class md_harmonize.harmonization.HarmonizedEdge(one_side: Union[md_harmonize.compound.Compound, md_harmonize.reaction.Reaction], other_side: Union[md_harmonize.compound.Compound, md_harmonize.reaction.Reaction], relationship: int, edge_type: Union[str, int], mappings: dict)[source]

The HarmonizedEdge representing compound or reaction pairs.

HarmonizedEdge initializer.

Parameters
  • one_side – one side of the edge. This can be compound or reaction.

  • other_side – the other side of the edge. This can be compound or reaction.

  • relationship – equivalent, generic-specific, or loose.

  • edge_type – for compound edge, this represents resonance, linear-circular, r group, same structure; for reaction edge, this represents 3 level match or 4 level match.

  • mappings – for compound edge, the mappings refer to mapped atoms between compounds; for reaction edge, the mappings refer to mapped compounds between reaction.

__init__(one_side: Union[md_harmonize.compound.Compound, md_harmonize.reaction.Reaction], other_side: Union[md_harmonize.compound.Compound, md_harmonize.reaction.Reaction], relationship: int, edge_type: Union[str, int], mappings: dict) → None[source]

HarmonizedEdge initializer.

Parameters
  • one_side – one side of the edge. This can be compound or reaction.

  • other_side – the other side of the edge. This can be compound or reaction.

  • relationship – equivalent, generic-specific, or loose.

  • edge_type – for compound edge, this represents resonance, linear-circular, r group, same structure; for reaction edge, this represents 3 level match or 4 level match.

  • mappings – for compound edge, the mappings refer to mapped atoms between compounds; for reaction edge, the mappings refer to mapped compounds between reaction.

property reversed_relationship

Get the relationship between the other side and one side.

Returns

the reversed relationship.

pair_relationship(name: str) → int[source]

When we map compounds in the reaction, we can access the compound edge from either side.

Parameters

name – the name of the searched one side.

Returns

the relationship of the searched pair.

class md_harmonize.harmonization.HarmonizedCompoundEdge(one_compound: md_harmonize.compound.Compound, other_compound: md_harmonize.compound.Compound, relationship: int, edge_type: str, atom_mappings: dict)[source]

The HarmonizedCompoundEdge representing compound pairs.

HarmonizedCompoundEdge initializer.

Parameters
  • one_compound – one Compound entity in the compound pair.

  • other_compound – the other Compound entity in the compound pair.

  • relationship – the relationship (equivalent, generic-specific, or loose) between the two compounds.

  • edge_type – the edge type can be resonance, linear-circular, r group, or same structure.

  • atom_mappings – the atom mappings between the two compounds.

__init__(one_compound: md_harmonize.compound.Compound, other_compound: md_harmonize.compound.Compound, relationship: int, edge_type: str, atom_mappings: dict) → None[source]

HarmonizedCompoundEdge initializer.

Parameters
  • one_compound – one Compound entity in the compound pair.

  • other_compound – the other Compound entity in the compound pair.

  • relationship – the relationship (equivalent, generic-specific, or loose) between the two compounds.

  • edge_type – the edge type can be resonance, linear-circular, r group, or same structure.

  • atom_mappings – the atom mappings between the two compounds.

property reversed_mappings

Get the atom mappings from compound on the other side to compound on the one side.

Returns

atom mappings between the other side compound to one side compound.

pair_atom_mappings(name: str) → dict[source]

Get the atom mappings of the harmonized compound edge, where one side equals to the parameter name.

Parameters

name – the compound name.

Returns

the atom mappings.

class md_harmonize.harmonization.HarmonizedReactionEdge(one_reaction: md_harmonize.reaction.Reaction, other_reaction: md_harmonize.reaction.Reaction, relationship: int, edge_type: int, compound_mappings: dict)[source]

The HarmonizedReactionEdge to represent reaction pairs.

HarmonizedReactionEdge initializer.

Parameters
  • one_reaction – one Reaction entity in the reaction pair.

  • other_reaction – the other Reaction entity in the reaction pair.

  • relationship – the relationship (equivalent, generic-specific, or loose) between the two reactions.

  • edge_type – the reactions can be 3-level EC or 4-level EC paired.

  • compound_mappings – the dictionary of paired compounds in the reaction pair.

__init__(one_reaction: md_harmonize.reaction.Reaction, other_reaction: md_harmonize.reaction.Reaction, relationship: int, edge_type: int, compound_mappings: dict) → None[source]

HarmonizedReactionEdge initializer.

Parameters
  • one_reaction – one Reaction entity in the reaction pair.

  • other_reaction – the other Reaction entity in the reaction pair.

  • relationship – the relationship (equivalent, generic-specific, or loose) between the two reactions.

  • edge_type – the reactions can be 3-level EC or 4-level EC paired.

  • compound_mappings – the dictionary of paired compounds in the reaction pair.

class md_harmonize.harmonization.HarmonizationManager[source]

The HarmonizationManger is responsible for adding, removing or searching harmonized edge.

HarmonizationManager initializer.

__init__() → None[source]

HarmonizationManager initializer.

save_manager() → list[source]

Save all the names of harmonized edges.

Returns

the list of harmonized edges.

static create_key(name_1: str, name_2: str) → str[source]

Create the edge key. Each edge is represented by a unique key in the harmonized_edges dictionary.

Parameters
  • name_1 – the name of one side of the edge.

  • name_2 – the name of the other side of the edge.

Returns

the key of the edge.

add_edge(edge: Union[md_harmonize.harmonization.HarmonizedCompoundEdge, md_harmonize.harmonization.HarmonizedReactionEdge]) → bool[source]

Add this edge to the harmonized edges.

Parameters

edge – the HarmonizedEdge entity.

Returns

bool whether the edge is added successfully.

remove_edge(edge: Union[md_harmonize.harmonization.HarmonizedCompoundEdge, md_harmonize.harmonization.HarmonizedReactionEdge]) → bool[source]

Remove this edge from the harmonized edges.

Parameters

edge – the HarmonizedEdge entity.

Returns

bool whether the edge is removed successfully.

search(name_1: str, name_2: str) → Union[md_harmonize.harmonization.HarmonizedCompoundEdge, md_harmonize.harmonization.HarmonizedReactionEdge, None][source]

Search the edge based on the names of the two sides.

Parameters
  • name_1 – the name of one side of the edge.

  • name_2 – the name of the other side of the edge.

Returns

edge if the edge exists or None.

class md_harmonize.harmonization.CompoundHarmonizationManager[source]

The CompoundHarmonizationManager is responsible for adding, removing or searching HarmonizedCompoundEdge.

CompoundHarmonizationManager initializer.

__init__() → None[source]

CompoundHarmonizationManager initializer.

static find_compound(compound_dict: list, compound_name: str) → Optional[md_harmonize.compound.Compound][source]

Find the Compound based on the compound name in the compound dict.

Parameters
  • compound_dict – a list of compound dictionaries.

  • compound_name – the target compound name.

Returns

the Compound.

static create_manager(compound_dict: list, compound_pairs: list)[source]

Create the CompoundHarmonizationManager based on the compound paris.

Parameters
  • compound_dict – the list of compound dictionaries.

  • compound_pairs – the list of compound pairs.

Returns

the CompoundHarmonizationManager

add_edge(edge: md_harmonize.harmonization.HarmonizedCompoundEdge) → bool[source]

Add a newly detected edge to the manager, and update the occurrences of compound in the harmonized edges. This is for calculating the jaccard index.

Parameters

edge – the HarmonizedCompoundEdge entity.

Returns

bool whether the edge is added successfully.

remove_edge(edge: md_harmonize.harmonization.HarmonizedCompoundEdge) → bool[source]

Remove the edge from the manager, and update the occurrences of compound in the harmonized edges.

Parameters

edge – the HarmonizedCompoundEdge entity.

Returns

bool whether the edge is removed successfully.

has_visited(name_1: str, name_2: str) → bool[source]

Check if the compound pair has been visited.

Parameters
  • name_1 – the name of one side of the edge.

  • name_2 – the name of the other side of the edge.

Returns

bool if the pair has been visited.

add_invalid(name_1: str, name_2: str) → None[source]

Add the name of invalid compound pair to the visited.

Parameters
  • name_1 – the name of one side of the edge.

  • name_2 – the name of the other side of the edge.

Returns

None.

get_edge_list() → list[source]

Get the names of all the harmonized edges.

Returns

the list of names of harmonized edges.

class md_harmonize.harmonization.ReactionHarmonizationManager(compound_harmonization_manager: md_harmonize.harmonization.CompoundHarmonizationManager)[source]

The ReactionHarmonizationManager is responsible for adding, removing or searching HarmonizedReactionEdge.

ReactionHarmonizationManager initializer.

Parameters

compound_harmonization_manager – the CompoundHarmonizationManager entity for compound pairs management.

__init__(compound_harmonization_manager: md_harmonize.harmonization.CompoundHarmonizationManager) → None[source]

ReactionHarmonizationManager initializer.

Parameters

compound_harmonization_manager – the CompoundHarmonizationManager entity for compound pairs management.

static compare_ecs(one_ecs: dict, other_ecs: dict) → int[source]

Compare two lists of EC numbers.

Parameters
  • one_ecs – the dict of EC numbers of one reaction.

  • other_ecs – the dict of EC numbers of the other reaction.

Returns

the level of EC number that they can be matched.

static determine_relationship(relationships: list) → int[source]

Determine the relationship of the reaction pair based on the relationship of paired compounds.

Parameters

relationships – the list of relationship of compound pairs in the two reactions.

Returns

the relationships between the two reactions.

harmonize_reaction(one_reaction: md_harmonize.reaction.Reaction, other_reaction: md_harmonize.reaction.Reaction) → None[source]

Test if two reactions can be harmonized.

Parameters
  • one_reaction – one Reaction that is involved in the reaction pair.

  • other_reaction – the other Reaction that is involved in the reaction pair.

Returns

None.

compound_mappings(one_compounds: list, other_compounds: list) → dict[source]

Get the mapped compounds in the two compound lists.

Parameters
  • one_compounds – one list of Compound entities.

  • other_compounds – the other list of Compound entities.

Returns

the dictionary of paired compounds with their relationship. The relationship will be used to determine the relationship of reaction pair.

unmapped_compounds(one_compounds: list, other_compounds: list, mappings: dict) → tuple[source]

Get the compounds that cannot be mapped. This can lead to new compound pairs.

Parameters
  • one_compounds – one list of Compound entities.

  • other_compounds – the other list of Compound entities.

  • mappings – the mapped compounds between the two compound lists.

Returns

two lists of compounds that cannot be mapped.

match_unmapped_compounds(one_side_left: list, other_side_left: list) → None[source]

Match the left compounds and add the valid compound pairs to the CompoundHarmonizationManager. We also add the invalid compound pairs to the CompoundHarmonizationManager to avoid redundant match.

Parameters
  • one_side_left – one list of left Compound entities.

  • other_side_left – the other list of left Compound entities.

Returns

None.

jaccard(one_compounds: list, other_compounds: list, mappings: dict) → float[source]

Calculate the jaccard index between the two list of compounds.

Parameters
  • one_compounds – one list of Compound entities.

  • other_compounds – the other list of Compound entities.

  • mappings – the dictionary of mapped compounds between the two compound lists.

Returns

the jaccard index of the two compound lists.

one_to_one_compound_mappings(mappings: dict) → Optional[tuple][source]

Find the one-to-one compound mappings between the two reactions. This step is to avoid very extreme cases that a compound in one reaction can be mapped to two or more compounds in the other reaction.

Parameters

mappings – the dictionary of compound mappings.

Returns

the tuple of relationship of compound pairs and dictionary of one-to-one compound mappings.

md_harmonize.harmonization.harmonize_compound_list(compound_dict_list: list) → md_harmonize.harmonization.CompoundHarmonizationManager[source]

Harmonize compounds across different databases based on the compound coloring identifier.

Parameters

compound_dict_list – the list of Compound dictionary from different sources.

Returns

the CompoundHarmonizationManager containing harmonized compound edges.

md_harmonize.harmonization.harmonize_reaction_list(reaction_lists: list, compound_harmonization_manager: md_harmonize.harmonization.CompoundHarmonizationManager) → md_harmonize.harmonization.ReactionHarmonizationManager[source]

Harmonize reactions across different sources based on the harmonized compounds. At the same time, this also harmonizes compound pairs with resonance, linear-circular, r group types.

Parameters
  • reaction_lists – a list of Reaction lists from different sources.

  • compound_harmonization_manager – a CompoundHarmonizationManager containing harmonized compound pairs with the same structure.

Returns

ReactionHarmonizationManager