The md_harmonize API Reference¶

This package includes the following modules.

md_harmonize.compound¶

This module provides the Atom class, the Bond class, and the Compound class to construct a compound entity. Most of the instance variables of these three classes are based on CTFile fields.

class md_harmonize.compound.Atom(atom_symbol: str, atom_number: int, x: str = '0', y: str = '0', z: str = '0', mass_difference: str = '0', charge: str = '0', atom_stereo_parity: str = '0', hydrogen_count: str = '0', stereo_care_box: str = '0', valence: str = '0', h0designator: str = '0', atom_atom_mapping_number: str = '0', inversion_retention_flag: str = '0', exact_change_flag: str = '0', kat: str = '', in_cycle: bool = False)[source]¶

Atom class describes the Atom entity in the compound.

Atom initializer.

Parameters

atom_symbol – atom_symbol.
atom_number – atom_number.
x – the atom x coordinate.
y – the atom y coordinate.
z – the atom z coordinate.
mass_difference – difference from mass in periodic table.
charge – charge.
atom_stereo_parity – atom stereo parity.
hydrogen_count – hydrogen_count.
stereo_care_box – stereo_care_box.
valence – valence.
h0designator – h0designator (obsolete CTFile parameter).
atom_atom_mapping_number – atom_atom_mapping_number.
inversion_retention_flag – inversion_retention_flag.
exact_change_flag – exact_change_flag.
kat – KEGG atom type.
in_cycle – whether the atom is in cycle.

__init__(atom_symbol: str, atom_number: int, x: str = '0', y: str = '0', z: str = '0', mass_difference: str = '0', charge: str = '0', atom_stereo_parity: str = '0', hydrogen_count: str = '0', stereo_care_box: str = '0', valence: str = '0', h0designator: str = '0', atom_atom_mapping_number: str = '0', inversion_retention_flag: str = '0', exact_change_flag: str = '0', kat: str = '', in_cycle: bool = False)[source]¶

Atom initializer.

Parameters

atom_symbol – atom_symbol.
atom_number – atom_number.
x – the atom x coordinate.
y – the atom y coordinate.
z – the atom z coordinate.
mass_difference – difference from mass in periodic table.
charge – charge.
atom_stereo_parity – atom stereo parity.
hydrogen_count – hydrogen_count.
stereo_care_box – stereo_care_box.
valence – valence.
h0designator – h0designator (obsolete CTFile parameter).
atom_atom_mapping_number – atom_atom_mapping_number.
inversion_retention_flag – inversion_retention_flag.
exact_change_flag – exact_change_flag.
kat – KEGG atom type.
in_cycle – whether the atom is in cycle.

update_symbol(symbol: str) → str[source]¶

To update the atom symbol.

Parameters: symbol – the updated atom symbol.
Returns: the updated atom_symbol.

update_atom_number(index: int) → int[source]¶

To update the atom number.

Parameters: index – the updated atom number.
Returns: the updated atom number.

remove_neighbors(neighbors: list) → list[source]¶

To remove neighbors from the atom.

Parameters: neighbors – the list of neighbors that will be removed from this atom.
Returns: the updated list of neighbors.

add_neighbors(neighbors: list) → list[source]¶

To add neighbors to the atom.

Parameters: neighbors – the list of neighbors that will be added to this atom.
Returns: the updated list of neighbors.

update_stereochemistry(stereo: str) → str[source]¶

To update the atom stereochemistry.

Parameters: stereo – the updated atom stereochemistry.
Returns: the updated atom stereochemistry.

color_atom(isotope_resolved: bool = False, charge: bool = False, atom_stereo: bool = False) → str[source]¶

To generate the atom color of the zero layer.

Parameters

isotope_resolved – If true, add isotope information when constructing colors.
charge – If true, add charge information when constructing colors.
atom_stereo – If true, add atom stereochemistry information when constructing colors.

Returns

the atom color of the zero layer.

reset_color() → None[source]¶

Reset the atom color.

Returns: None.

update_kat(kat: str) → str[source]¶

To update the atom KEGG atom type.

Parameters: kat – the KEGG atom type for this atom,
Returns: the updated KEGG atom type.

update_cycle(cycle_status: bool) → bool[source]¶

To update the cycle status of the atom

Parameters: cycle_status – whether the atom is in cycle
Returns: cycle status

clone()[source]¶

To clone the atom.

Returns: the cloned atom.

class md_harmonize.compound.Bond(first_atom_number: str, second_atom_number: str, bond_type: str, bond_stereo: str = '0', bond_topology: str = '0', reacting_center_status: str = '0')[source]¶

Bond class describes the Bond entity in the compound.

Bond initializer.

Parameters

first_atom_number – the index of the first atom forming this bond.
second_atom_number – the index of the second atom forming this bond.
bond_type – the bond type. (1 = Single, 2 = Double, 3 = Triple, 4 = Aromatic, 5 = Single or Double, 6 = Single or Aromatic, 7 = double or Aromatic 8 = Any)
bond_stereo – the bond stereo. (Single bonds: 0 = not stereo, 1 = Up, 4 = Either, 6 = Down; Double bonds: determined by x, y, z coordinates)
bond_topology – bond topology. (O = Either, 1 = Ring, 2 = Chain)
reacting_center_status – reacting center status.

__init__(first_atom_number: str, second_atom_number: str, bond_type: str, bond_stereo: str = '0', bond_topology: str = '0', reacting_center_status: str = '0')[source]¶

Bond initializer.

Parameters

first_atom_number – the index of the first atom forming this bond.
second_atom_number – the index of the second atom forming this bond.
bond_type – the bond type. (1 = Single, 2 = Double, 3 = Triple, 4 = Aromatic, 5 = Single or Double, 6 = Single or Aromatic, 7 = double or Aromatic 8 = Any)
bond_stereo – the bond stereo. (Single bonds: 0 = not stereo, 1 = Up, 4 = Either, 6 = Down; Double bonds: determined by x, y, z coordinates)
bond_topology – bond topology. (O = Either, 1 = Ring, 2 = Chain)
reacting_center_status – reacting center status.

update_bond_type(bond_type: str) → str[source]¶

To update the bond type.

Parameters: bond_type – the updated bond type.
Returns: the updated bond type.

update_stereochemistry(stereo: str) → str[source]¶

To update the bond stereochemistry.

Parameters: stereo – the updated bond stereochemistry.
Returns: the updated bond stereochemistry.

update_first_atom(index: int) → int[source]¶

To update the first atom number of the bond.

Parameters: index – the updated first atom number.
Returns: the updated first atom number.

update_second_atom(index: int) → int[source]¶

To update the second atom number of the bond.

Parameters: index – the updated second atom number.
Returns: the updated second atom number.

clone()[source]¶

To clone the bond.

Returns: the cloned bond.

class md_harmonize.compound.Compound(compound_name: str, atoms: list, bonds: list)[source]¶

Compound class describes the Compound entity.

Compound initializer.

Parameters

compound_name – the compound name.
atoms – a list of Atom entities in the compound.
bonds – a list of Bond entities in the compound.

__init__(compound_name: str, atoms: list, bonds: list) → None[source]¶

Compound initializer.

Parameters

compound_name – the compound name.
atoms – a list of Atom entities in the compound.
bonds – a list of Bond entities in the compound.

encode() → tuple[source]¶

To clone the compound.

Returns: the cloned compound.

property name¶

To get the compound name.

Returns: the compound name.

static molfile_name(molfile: str)[source]¶

Create the compound entity based on the molfile representation.

Parameters: molfile – the filename of the molfile.
Returns: the constructed compound entity.

property formula¶

To construct the formula of this compound (only consider heavy atoms).

Returns: string formula of the compound.

property composition¶

To get the atom symbols and bond types in the compound.

Returns: the atom and bond information of the compound

property r_groups¶

To get all the R groups in the compound.

Returns: the list of index of all the R groups.

contains_r_groups() → bool[source]¶

To check if the compound contains R group(s).

Returns: bool whether the compound contains R group.

has_isolated_atoms() → bool[source]¶

To check if the compound has atoms that have no connections to other atoms.

Returns: bool whether the compound has isolated atoms.

property metal_index¶

To get the metal elements in the compound.

Returns: a list of atom numbers of metal elements.

property h_index¶

To get all H in the compound.

Returns: a list of atom numbers corresponding to H.

property heavy_atoms¶

To get all the heavy atoms in the compound.

Returns: a list of atom numbers corresponding to heavy atoms.

property index_of_heavy_atoms¶

To map the atom number to index in the heavy atom list.

Returns: the dictionary of atom number to atom index of heavy atoms.

color_groups(excluded=None) → dict[source]¶

To update the compound color groups after coloring.

Returns: the dictionary of atom color with the list of atom number.

detect_abnormal_atom() → dict[source]¶

To find the atoms with invalid bond counts.

Returns: a list of atom numbers with invalid bond counts.

curate_invalid_n() → None[source]¶

To curate the charge of invalid N atoms.

Returns: None.

update_aromatic_bond_type(cycles: list) → None[source]¶

Update the aromatic bond types. Two cases: 1) change the bond in the aromatic ring to aromatic bond (bond type = 4); 2) change the double bond connecting to the aromatic ring to single bond.

Parameters: cycles – the list of cycles represented by aromatic atom index.
Returns: None.

extract_double_bond_connecting_cycle(atom_in_cycle: list) → list[source]¶

To extract the double bonds connecting to the atom in the aromatic cycles.

Parameters: atom_in_cycle – the list of aromatic cycles represented by aromatic atom index.
Returns: the list of outside double bond connecting to the atom in the aromatic cycles.

extract_aromatic_bonds(cycle: list) → list[source]¶

Extract the aromatic bonds based on the atoms in the cycle.

Parameters: cycle – the list of aromatic cycles represented by aromatic atom index.
Returns: the list of aromatic bonds.

separate_connected_components(bonds: Union[list, set]) → list[source]¶

This is used in constructing the aromatic substructures detected by the Indigo method. A compound can have several disjoint aromatic substructures. Here, we need to find the disjoint parts. The basic idea is union-find. We union atoms that are connected by a bond.

Parameters: bonds – the list of bonds representing by the atom numbers forming the bond.
Returns: a list of separate components represented by a list atom numbers in the component.

connected_components() → dict[source]¶

Detect the connected components in the compound structure (using the breadth first search). Cases when not all the atoms are connected together.

Returns: the dictionary of the connected components.

calculate_distance_to_r_groups() → None[source]¶

To calculate the distance of each atom to its nearest R group (using the dijkstra’s algorithm).

Returns: None:

find_cycles(short_circuit: bool = False, cutoff: int = 40, seconds=50) → list[source]¶

To find the cycles in the compound.

Parameters

short_circuit – whether to take short path.
cutoff – limit of cycle length.
seconds – the timeout limit.

Returns

the list of cycles in the compound.

find_cycles_helper(short_circuit: bool = False, cutoff: int = 40) → list[source]¶

Executing function to find the cycles in the compound.

Parameters

short_circuit – whether to take short path.
cutoff – limit of cycle length.

Returns

the list of cycles in the compound

structure_matrix(resonance: bool = False, backbone: bool = False) → numpy.ndarray[source]¶

To construct graph structural matrix of this compound. matrix[i][j] = 0 suggests the two atoms are not connected directly. Other integer represented the bond type connecting the two atoms.

Parameters

resonance – bool whether to ignore the difference between single and double bonds.
backbone – bool whether to ignore bond types. This is for parsing atoms mappings from KEGG RCLASS.

Returns

the constructed structure matrix for this compound.

property distance_matrix¶

To construct the distance matrix of the compound (using the Floyd Warshall Algorithm). distance[i][j] suggests the distance between atom i and j.

Returns: the distance matrix of the compound.

update_color_tuple(resonance: bool = False) → None[source]¶

To update the color tuple of the atoms in the compound. This color tuple includes information of its neighboring atoms and bonds. Here, we don’t need to consider backbone since this part was initially designed for aromatic substructure detection and only double and single bonds are considered.

Parameters: resonance – bool whether to ignore the difference between single and double bonds.
Returns: None.

find_mappings(the_other, resonance: bool = True, r_distance: bool = False, backbone: bool = False) → list[source]¶

Find the one to one atom mappings between two compounds using the BASS algorithm. The other compound is supposed be contained in the self compound.

Parameters

the_other – the mappings compound entity.
resonance – whether to ignore the difference between single and double bonds.
r_distance – whether to take account of the position of R groups.
backbone – whether to ignore the bond types.

Returns

the list of atom mappings in the heavy atom order.

find_mappings_reversed(the_other, resonance: bool = True, r_distance: bool = False, backbone: bool = False) → list[source]¶

Find the one to one atom mappings between two compounds using the BASS algorithm. The self compound is supposed to be contained in the other compound.

Parameters

the_other – the mappings compound entity.
resonance – whether to ignore the difference between single and double bonds.
r_distance – whether to take account of the position of R groups.
backbone – whether to ignore the bond types.

Returns

the list of atom mappings in the heavy atom order.

map_resonance(the_other, r_distance: bool = False, seconds: int = 50) → list[source]¶

Check if the resonant mappings are valid between two compound structures.

Parameters

the_other – the mappings compound entity.
r_distance – to take account of the position of R groups.
seconds – the timeout limit.

Returns

the list of valid atom mappings between the two compound structures.

map_resonance_helper(the_other, r_distance: bool = False) → list[source]¶

Check if the resonant mappings are valid between the two compound structures. If the mapped atoms don’t share the same local coloring identifier, we check if the difference is caused by the position of double bonds. Find the three atoms involved in the resonant structure and check if one of the atoms is not C. N (a) N (a) / // (b) C N (c) (b) C N (c)

In addition, the self compound is supposed to be more generic, which means has fewer atoms. Therefore, atoms in self compound can all be mapped to the other compound.

Parameters

the_other – the mappings compound entity.
r_distance – to take account of position of R groups.

Returns

the list of valid atom mappings between the two compound structures.

find_double_bond_linked_atom(i: int) → int[source]¶

Find the atom that is doubly linked to the target atom i.

Parameters: i – the ith atom in the compound.
Returns: the index of the doubly linked atom.

define_bond_stereochemistry() → None[source]¶

Define the stereochemistry of double bonds in the compound.

Returns: None.

calculate_bond_stereochemistry(bond: md_harmonize.compound.Bond) → int[source]¶

Calculate the stereochemistry of the double bond based on its geometric properties. The line of the double bond divides the plane into two parts. For the atoms forming the double bond, it normally has two branches. If the two branches are not the same, we call them heavy side and light side (heavy side containing atoms with heavier atomic weights). We determine the bond stereochemistry by checking if the two heavy sides lie on the same part of the divided plane.

Parameters: bond – the bond entity.
Returns: the calculated bond stereochemistry.

static calculate_y_coordinate(slope: float, b: float, atom: md_harmonize.compound.Atom) → float[source]¶

Calculate the y coordinate of the atom based on the linear function: y = slope * x + b

Parameters

slope – the slope of the targeted line.
b – the intercept of the targeted line.
atom – the atom entity.

Returns

the calculated y coordinate.

collect_atomic_weights_of_neighbors(neighbors: list) → list[source]¶

To collect the atomic weights of the current layer’s neighbors.

Parameters: neighbors – the list of atom numbers of neighbors.
Returns: the list of atomic weights for this layer’s neighbors.

compare_branch_weights(neighbors: list, atom_forming_double_bond: md_harmonize.compound.Atom) → tuple[source]¶

To determine the heavy and light branches that connect to the atom forming the double bond. This is based on comparison of the atomic weights of the two branches (breadth first algorithm).

Parameters

neighbors – the list of atom numbers of the atoms that connect the atom forming the double bond.
atom_forming_double_bond – the atom that forms the bond.

Returns

heavy and light branches. [heavy_side, light_side]

get_next_layer_neighbors(cur_layer_neighbors: list, visited: set, excluded: list = None) → list[source]¶

To get the next layer’s neighbors.

Parameters

cur_layer_neighbors – the list of atom numbers of the current layer.
visited – the atom numbers that have already been visited.
excluded – the list of atom numbers that should not be included in the next layer.

Returns

the neighboring atom numbers of the next layer.

color_compound(r_groups: bool = True, bond_stereo: bool = False, atom_stereo: bool = False, resonance: bool = False, isotope_resolved: bool = False, charge: bool = False, backbone: bool = False) → None[source]¶

To color the compound.

Parameters

r_groups – If true, add R groups in the coloring.
bond_stereo – If true, add bond stereo detail when constructing colors.
atom_stereo – If true, add atom stereo detail when constructing colors.
resonance – If true, ignore the difference between double and single bonds.
isotope_resolved – If true, add isotope detail when constructing colors.
charge – If true, add charge detail when constructing colors.
backbone – If true, ignore bond types in the coloring.

Returns

None.

reset_color() → None[source]¶

To set the color of atoms in the compound to be empty.

Returns: None:

generate_atom_zero_layer_color(isotope_resolved: bool = False, charge: bool = False, atom_stereo: bool = False) → None[source]¶

To generate the color identifier of zero layer for each atom. We don’t consider H and metals here.

Parameters

isotope_resolved – If true, add isotope detail when constructing colors.
charge – If true, add charge detail when constructing colors.
atom_stereo – If true, add atom stereochemistry detail when constructing colors.

Returns

None.

generate_atom_color_with_neighbors(atom_index: list, excluded: list = None, zero_core_color: bool = True, zero_neighbor_color: bool = True, resonance: bool = False, bond_stereo: bool = False, backbone: bool = False) → dict[source]¶

To generate the atom color with its neighbors. We add this color name when we try to incorporate neighbors’ information in naming.

Here, we don’t need to care about the atom stereo. It has been taken care of in generating color_0.

Basic color formula: atom.color + [neighbor.color + bond.bond_type]

Parameters

atom_index – indices of atoms to color.
excluded – the list of atom indices will be excluded from coloring.
zero_core_color – If true, we use the atom.color_0 else atom.color for the core atom (first round coloring vs validation).
zero_neighbor_color – If true, we use the atom.color_0 else atom.color for the neighbor atoms (first round coloring vs validation).
resonance – If true, detect resonant compound pairs without distinguishing between double and single bonds.
bond_stereo – If true, add stereo detail of bonds when constructing colors.
backbone – If true, ignore bond types in the coloring.

Returns

the dictionary of atom index and its color name.

first_round_color(atoms_to_color: list, excluded_index: list = None, bond_stereo: bool = False, resonance: bool = False, backbone: bool = False, depth: int = 5000) → None[source]¶

To do the first round of coloring this compound. We add neighbors’ information layer by layer to the atom’s color identifier until it has a unique identifier or all the atoms in the compound have been used for naming (based on the breadth first search algorithm).

Parameters

atoms_to_color – the list of atom numbers to be colored.
excluded_index – the list of atom numbers to be excluded from coloring.
bond_stereo – If true, add bond stereo detail when constructing colors.
resonance – If true, ignore the difference between double and single bonds.
backbone – If true, ignore bond types in the coloring.
depth – the max depth of coloring.

Returns

None.

invalid_symmetric_atoms(atoms_to_color: list, excluded_index: bool = None, bond_stereo: bool = False, resonance: bool = False, backbone: bool = False) → list[source]¶

To check if atoms with the same color identifier are symmetric.

Parameters

atoms_to_color – the list of atom numbers to be colored.
excluded_index – the list of atom numbers to be excluded from coloring.
bond_stereo – If true, add bond stereo detail when constructing colors.
resonance – If true, ignore the difference between double and single bonds.
backbone – If true, ignore bond types in the coloring.

Returns

the list of atom numbers to be recolored.

curate_invalid_symmetric_atoms(atoms_to_color: list, excluded_index: list = None, bond_stereo: bool = False, resonance: bool = False, backbone: bool = False) → None[source]¶

To curate the color identifiers of invalid symmetric atoms. We recolor those invalid atoms using the full color identifiers of its neighbors layer by layer until the difference can be captured.

Parameters

atoms_to_color – the list of atom numbers of atoms to be colored.
excluded_index – the list of atom numbers of atoms to be excluded from coloring.
bond_stereo – If true, add stereo information to bonds when constructing colors.
resonance – If true, ignore the difference between double bonds and single bonds.
backbone – If true, ignore bond types in the coloring.

Returns

None.

color_metal(bond_stereo: bool = False, resonance: bool = True, backbone: bool = False) → None[source]¶

To color the metals in the compound. Here we just incorporate information of directly connected atoms.

Parameters

bond_stereo – If true, add bond stereo detail when constructing colors.
resonance – If true, ignore difference between double and single bonds.
backbone – If true, ignore the bond types.

Returns

None.

color_h(bond_stereo: bool = False, resonance: bool = True, backbone: bool = False) → None[source]¶

To color the H in the compound. Here we just incorporate information of directly connected atoms.

Parameters

bond_stereo – If true, add bond stereo detail when constructing colors.
resonance – If true, ignore difference between double and single bonds.
backbone – If true, ignore bond types.

Returns

None.

metal_color_identifier(details: bool = True) → str[source]¶

To generate the metal coloring string representation.

Parameters: details – if true, to use full metal color when constructing identifier.
Returns: the metal coloring string representation.

h_color_identifier(details: bool = True) → str[source]¶

To generate the H coloring string representation.

Parameters: details – if true, use the full H color when constructing identifier.
Returns: the H coloring string representation.

backbone_color_identifier(r_groups: bool = False) → str[source]¶

To generate the backbone coloring string representation for this compound. Exclude Hs and metals.

Parameters: r_groups – whether to include the R group.
Returns: the coloring string representation for this compound.

get_chemical_details(excluded: list = None) → list[source]¶

To get the chemical details of the compound, which include the atom stereo chemistry and bond stereo chemistry. This is to compare the compound with the same structures (or the same color identifiers).

Parameters: excluded – a list of atom indices to be ignored.
Returns: the list of chemical details in the compound.

static compare_chemical_details(one_chemical_details: list, the_other_chemical_details: list) → tuple[source]¶

To compare the chemical details of the two compounds.

Then return the relationship between the two compounds.

The relationship can be equivalent, generic-specific and loose, represented by 0, (-1, 1), 2

Parameters

one_chemical_details – the chemical details of one compound.
the_other_chemical_details – the chemical details of the other compound.

Returns

the relationship between the two structures and the count of chemical details that cannot be mapped.

same_structure_relationship(the_other_compound) → tuple[source]¶

To determine the relationship of two compounds with the same structure.

Parameters: the_other_compound – the other Compound entity.
Returns: the relationship and the atom mappings between the two compounds.

generate_atom_mapping_by_atom_color(the_other_compound) → dict[source]¶

To generate the atom mappings between the two compounds.

Assume the two compounds have the same structure, so we can achieve atom mappings through atom colors.

Parameters: the_other_compound – the other Compound entity.
Returns: the atom mappings between the two compounds.

optimal_resonant_mapping(the_other_compound, mappings: list) → tuple[source]¶

To find the optimal atom mappings for compound pairs that are resonant type.

Parameters

the_other_compound – the other Compound entity.
mappings – the list of atom mappings between the two compounds detected by BASS.

Returns

the relationship and the atom mappings between the two compounds.

static determine_relationship(unmapped_count: dict) → int[source]¶

To determine the relationship between two compounds when there are multiple possible atom mappings.

We try to map as many details as possible.

0: equivalent; 1: self is more generic than the other compound; -1: the other compound is more generic than self; 2: either has chemical detail(s) that the other compound does not have.

Parameters: unmapped_count – the dictionary of relationship to the count of details that cannot be mapped.
Returns: the relationship between the two compounds.

circular_pair_relationship(other_compound, seconds: int = 50) → tuple[source]¶

To determine the relationship of two compounds with interchangeable circular and linear representations with time limit.

Parameters

other_compound – the other Compound entity.
seconds – the timeout limit.

Returns

the relationship and the atom mappings between the two compounds.

circular_pair_relationship_helper(other_compound) → tuple[source]¶

To determine the relationship of two compounds with interchangeable circular and linear representations. We first find the critical atoms that involve in the formation of the ring. There can be several possibilities. Then we break the ring, and restore the double bond in the aldehyde group that forms the ring. Finally, check if the updated structure is the same with the other compound. And determine the relationship between the two compounds as well as generate the atom mappings.

Parameters: other_compound – the other Compound entity.
Returns: the relationship and the atom mappings between the two compounds.

break_cycle(critical_atoms: int) → None[source]¶

To break the cycle caused by aldol reaction, which often occurs in the sugar. Two steps are involved: 1) remove the neighbors. 2) restore the double bond in the aldehyde group.

Parameters: critical_atoms – the three critical atoms that are involved in the ring formation.
Returns: None.

restore_cycle(critical_atoms: list) → None[source]¶

To restore the ring caused by aldol reaction. The reverse process of break_cycle.

Parameters: critical_atoms – the three atoms are involved in the aldol reaction.
Returns: None.

find_critical_atom_in_cycle() → list[source]¶

To find the C (atom_c) and O (atom_oo) in aldehyde group, as well as O (atom_o) in the hydroxy that are involved in the ring formation. We need to break the bond between the atom_c and atom_o to form the linear transformation. Please check one example of aldol reaction in the sugar if the description is not confusing.

Returns: the list of critical atoms.

update_atom_symbol(index: list, updated_symbol: str) → None[source]¶

To update the atom symbols. This is often used to remove/restore R group.

Parameters

index – the atom symbols of these indices to be updated.
updated_symbol – the updated symbol.

Returns

None.

validate_mapping_with_r(other_compound, one_rs: list, mapping: dict) → bool[source]¶

To validate the atom mappings with r groups. Here are two things we need to pay attention to:

For the generic compound, the R group can be mapped to a branch or just H in the specific compound.
For the specific compound, every unmatched branch needs to correspond to an R group in the generic compound.

In other words, the generic compound can have extra R groups that have no matched branch, but the specific compound cannot have unmatched branches that don’t correspond to any R groups.

For the specific validation:

1) We find all the linkages of R group and mapped atom in the compound, represented by the corresponding atom number in the other compound and the bond type. (We used the corresponding atom number in the other compound for the next comparison of the R linkages in the two compounds.

2) For every mapped atom in the other compound, we need to find if it has neighbors that are not mapped. Then the atom should be linked to a R group. We represent the linkage by the atom number and the bond type.

3) Based on the above validation criteria, we have to make sure that the R linkages in the other compound is the subset of the R linkages in this compound.

Parameters

other_compound – the other Compound entity.
one_rs – the R groups in the compound.
mapping – the atom mappings between the mapped parts of the two compounds.

Returns

bool whether the atom mappings are valid.

compare_chemical_details_with_mapping(other_compound, mapping: dict) → tuple[source]¶

To compare the chemical details of mapped atoms of the two compounds. This part targets compound pairs with resonance or r_group type. Only parts of chemicals need to be checked. 1) atoms are not involved in resonance part or connected to R groups (both cases can be tested by the first layer atom coloring identifier). 2) bond are formed by the atoms described above.

Parameters

other_compound – the other Compound entity.
mapping – the mapped atoms between the two compounds.

Returns

the count of chemical details that cannot be mapped.

optimal_mapping_with_r(other_compound, one_rs: list, mappings: list) → tuple[source]¶

To find the optimal mappings of compound pairs belonging to r_group type. In this case, multiple valid mappings can exist. We need to find the optimal one with the minimal unmapped chemical details. And the unmapped chemical details can exist in both compounds (generic or specific). The unmapped chemical details will determine the relationship of the compound pair. The priority: generic-specific, loose. The relationship cannot be equivalent.

Parameters

other_compound – the other Compound entity.
one_rs – the list of R groups in the compound.
mappings – the atom mappings of the mapped parts in the two compounds.

Returns

the relationship and atom mappings between the two compounds.

with_r_pair_relationship(other_compound, seconds: int = 50) → tuple[source]¶

To find the relationship and the atom mappings between the two compounds that have r_groups type with a time limit.

Parameters

other_compound – the other Compound entity.
seconds – the timeout limit.

Returns

the relationship and the atom mappings between the two compounds.

with_r_pair_relationship_helper(other_compound) → tuple[source]¶

To find the relationship and the atom mappings between the two compounds that have r_groups type. Several steps are involved:

1) Ignore the R groups in the two compounds and find if one compound (generic compound) is included in the other compound (specific compound).

2) If we can find the mappings, then we need to validate the mappings with the validate_mapping_with_r function.

3）Then we get the optimal atom mappings of the mapped parts.

4) We need to map the unmatched branches in the specific compound to the corresponding R group in the generic compound.

Parameters: other_compound – the other Compound entity.
Returns: the relationship and the atom mappings between the two compounds.

map_r_correspondents(one_rs: list, other_compound, mappings: dict) → dict[source]¶

To map the unmatched branches in the specific compound to the corresponding R group in the generic compound.

Parameters

one_rs – the list of R groups in the compound.
other_compound – the other Compound entity.
mappings – the atom mappings of the mapped parts in the two compounds.

Returns

the full atom mappings between the two compounds.

md_harmonize.reaction¶

This module provides the Reaction class entity.

class md_harmonize.reaction.Reaction(reaction_name: str, one_side: list, other_side: list, ecs: dict, atom_mappings: list, coefficients: dict)[source]¶

Reaction class describes the Reaction entity.

Reaction initializer.

Parameters

reaction_name – the reaction name.
one_side – the list of Compound entities in one side of the reaction.
other_side – the list of Compound entities in the other side of the reaction.
ecs – the dict of Enzyme Commission numbers (EC numbers) of the reaction.
atom_mappings – the list of atom mappings between two sides of the reaction.
coefficients – the dictionary of compound names and their corresponding coefficients in the reaction.

__init__(reaction_name: str, one_side: list, other_side: list, ecs: dict, atom_mappings: list, coefficients: dict) → None[source]¶

Reaction initializer.

Parameters

reaction_name – the reaction name.
one_side – the list of Compound entities in one side of the reaction.
other_side – the list of Compound entities in the other side of the reaction.
ecs – the dict of Enzyme Commission numbers (EC numbers) of the reaction.
atom_mappings – the list of atom mappings between two sides of the reaction.
coefficients – the dictionary of compound names and their corresponding coefficients in the reaction.

property name¶

To get the reaction name.

Returns: the reaction name.

md_harmonize.KEGG_database_scraper¶

This module provides functions to download KEGG data (including compound, reaction, kcf, and rclass) from the KEGG (REST) API.

The URLs can change.

md_harmonize.KEGG_database_scraper.entry_list(target_url: str) → list[source]¶

To get the list of entity name to download.

Parameters: target_url – the url to fetch.
Returns: the list of entry names.

md_harmonize.KEGG_database_scraper.update_entity(entries: list, sub_directory: str, directory: str, suffix: str = '') → None[source]¶

To download the KEGG entity (compound, reaction, or rclass) and save it into a file.

Parameters

entries – the list of entry names to download.
sub_directory – the subdirectory to save the downloaded file.
directory – the main directory to save the downloaded file.
suffix – the suffix needed for download, like the mol for compound molfile and kcf for compound kcf file.

Returns

None.

md_harmonize.KEGG_database_scraper.curate_molfile(file_path: str) → None[source]¶

To curate the molfile representation.

Parameters: file_path – the path to the molfile.
Returns: None.

md_harmonize.KEGG_database_scraper.download(directory: str) → None[source]¶

To download all the KEGG required files.

Parameters: directory – the directory to store the data.
Returns: None.

md_harmonize.KEGG_parser¶

This module provides functions to parse KEGG data (including compound, reaction, kcf, and rclass).

md_harmonize.KEGG_parser.kegg_data_parser(data: list) → dict[source]¶

This is to parse KEGG data (reaction, rclass, compound) file to a dictionary.

eg:

ENTRY R00259 Reaction

NAME acetyl-CoA:L-glutamate N-acetyltransferase

DEFINITION Acetyl-CoA + L-Glutamate <=> CoA + N-Acetyl-L-glutamate

EQUATION C00024 + C00025 <=> C00010 + C00624

RCLASS RC00004 C00010_C00024

RC00064 C00025_C00624

ENZYME 2.3.1.1

PATHWAY rn00220 Arginine biosynthesis

rn01100 Metabolic pathways

rn01110 Biosynthesis of secondary metabolites

rn01210 2-Oxocarboxylic acid metabolism

rn01230 Biosynthesis of amino acids

MODULE M00028 Ornithine biosynthesis, glutamate => ornithine

M00845 Arginine biosynthesis, glutamate => acetylcitrulline => arginine

ORTHOLOGY K00618 amino-acid N-acetyltransferase [EC:2.3.1.1]

K00619 amino-acid N-acetyltransferase [EC:2.3.1.1]

K00620 glutamate N-acetyltransferase / amino-acid N-acetyltransferase [EC:2.3.1.35 2.3.1.1]

K11067 N-acetylglutamate synthase [EC:2.3.1.1]

K14681 argininosuccinate lyase / amino-acid N-acetyltransferase [EC:4.3.2.1 2.3.1.1]

K14682 amino-acid N-acetyltransferase [EC:2.3.1.1]

K22476 N-acetylglutamate synthase [EC:2.3.1.1]

K22477 N-acetylglutamate synthase [EC:2.3.1.1]

K22478 bifunctional N-acetylglutamate synthase/kinase [EC:2.3.1.1 2.7.2.8]

DBLINKS RHEA: 24295

///

Parameters: data – the KEGG reaction description.
Returns: the dictionary of parsed KEGG data.

md_harmonize.KEGG_parser.parse_equation(equation: str) → tuple[source]¶

This is to parse the KEGG reaction equation.

eg: C00029 + C00001 + 2 C00003 <=> C00167 + 2 C00004 + 2 C00080

Parameters: equation – the equation string.
Returns: the parsed KEGG reaction equation.

md_harmonize.KEGG_parser.kegg_kcf_parser(kcf: list) → dict[source]¶

This is to parse KEGG kcf file to a dictionary.

eg:

ENTRY C00013 Compound

ATOM 9

1 P1b P 22.2269 -20.0662

2 O2c O 23.5190 -20.0779

3 O1c O 21.0165 -20.0779

4 O1c O 22.2851 -21.4754

5 O1c O 22.2617 -18.4642

6 P1b P 24.8933 -20.0837

7 O1c O 24.9401 -21.4811

8 O1c O 26.1797 -20.0662

9 O1c O 24.9107 -18.4582

BOND 8

1 1 2 1

2 1 3 1

3 1 4 1

4 1 5 2

5 2 6 1

6 6 7 1

7 6 8 1

8 6 9 2

///

Parameters: kcf – the kcf text.
Returns: the dictionary of parsed kcf file.

class md_harmonize.KEGG_parser.reaction_center(i, kat, label, match, difference)¶

Create new instance of reaction_center(i, kat, label, match, difference)

__getnewargs__()¶: Return self as a plain tuple. Used by copy and pickle.

static __new__(_cls, i, kat, label, match, difference)¶: Create new instance of reaction_center(i, kat, label, match, difference)

__repr__()¶: Return a nicely formatted representation string

difference¶: Alias for field number 4

i¶: Alias for field number 0

kat¶: Alias for field number 1

label¶: Alias for field number 2

match¶: Alias for field number 3

class md_harmonize.KEGG_parser.RpairParser(rclass_name: str, rclass_definitions: list, one_compound: md_harmonize.compound.Compound, other_compound: md_harmonize.compound.Compound)[source]¶

This is to get one-to-one atom mappings between two compounds based on the rclass definition.

Several steps are involved in this process:

1. The rclass definition can have several pieces. Each piece describes a center atom (R) and its connected atoms. The connected atoms can stay the same (M) or change (D) between the two compound structures.

First we need to find the center atoms based on the rclass descriptions.

3. For each center atom, there are can multiple candidates. In other words, based on the RDM description, a bunch of atoms in the compound can meet the descriptions. (One simple case are the symmetric compounds).

Therefore, we need to generate the all the combinations for the center atoms in a compound.

eg: if there are three atom centers, each center has several candidates:

center 1: [0, 1, 2]; center 2: [5, 6]; center 3: [10, 11]

The combinations for the center atoms:

[0, 5, 10], [0, 5, 11], [0, 6, 10], [0, 6, 11], [1, 5, 10], [1, 5, 11], [1, 6, 10], [1, 6, 11], [2, 5, 10], [2, 5, 11], [2, 6, 10], [2, 6, 11]

Next, we need to find the one-to-one atom mappings between the two compounds based on the mapped center atoms.

6. To solve this issue, we first disassemble each compound into different components. This is due to the difference atoms in the two compounds, i.e. broken bonds.

7. Then we need to find the mappings between each disassembled component, and concatenate the mappings of all the components.

8. To find the one-to-one atom mappings, we use the BASS algorithm. We assume the mapped component have the same structure since we have already removed the different parts. However, here we only map the backbone of the structure (in other words, we simply all the bond type to 1) due to bond change (double bond to single bond or triple bond to single bond)

9. To ensure the optimal mappings, we count the mapped atoms with changed local environment and choose the mapping with minimal changes.

RpairParser initializer.

Parameters

rclass_name – the rclass name.
rclass_definitions – a list of rclass definitions.
one_compound – one compound involved in the pair.
other_compound – the other compound involved in the pair.

__init__(rclass_name: str, rclass_definitions: list, one_compound: md_harmonize.compound.Compound, other_compound: md_harmonize.compound.Compound)[source]¶

RpairParser initializer.

Parameters

rclass_name – the rclass name.
rclass_definitions – a list of rclass definitions.
one_compound – one compound involved in the pair.
other_compound – the other compound involved in the pair.

map_atom_by_colors() → dict[source]¶

Roughly map the atoms between the two compounds by the atom color.

Returns: the dict of mapped atom index between the two compounds.

map_whole_compound() → dict[source]¶

Map two compounds if the two compounds can be roughly mapped by the atom color.

Returns: the dict of mapped atom in the two compounds.

static generate_kat_neighbors(this_compound: md_harmonize.compound.Compound) → list[source]¶

Generate the atom neighbors represented by KEGG atom type for each atom in the compound. This is used to find the center atom. We used KEGG atom type since the descriptions of atoms in the rclass definitions using KEGG atom type.

Parameters: this_compound – the compound entity.
Returns: the list of atom with its neighbors.

static find_target_atom(atoms: list, target: tuple) → list[source]¶

Find the target atoms from a list of candidate atoms.

Parameters

atoms – a list of atoms to search from.
target – the target atom to be searched.

Returns

the list of atom numbers that match the target atom.

static create_reaction_centers(i: int, kat: str, difference: list, the_other_difference: list, match: list, the_other_match: list) → collections.namedtuple[source]¶

Create the center atom based on its connected atoms and its counterpart atom in the other compound.

Parameters

i – the ith rclass definition.
kat – the KEGG atom type of the center atom.
difference – the list of KEGG atom type of different connected atoms.
the_other_difference – the list of KEGG atom type of different connected atoms in the other compound.
match – the list of KEGG atom type of the matched connected atoms.
the_other_match – the list of KEGG atom type of the matched connected atoms in the other compound.

Returns

the constructed reaction center.

find_center_atoms() → tuple[source]¶

Example of rclass definition:

C8x-C8y:*-C1c:N5y+S2x-N5y+S2x

The RDM pattern is defined as KEGG atom type changes at the reaction center (R), the difference region (D), and the matched region (M) for each reactant pair. It characterizes chemical structure transformation patterns associated with enzymatic reactions.

Returns: the list of reaction centers and the corresponding candidate atoms.

static get_center_list(center_atom_index: list) → list[source]¶

Generate all the combinations of reaction centers.

Parameters: center_atom_index – list of atom index list for each reaction centers. eg: three reaction centers: [[0, 1, 2], [5, 6], [10, 11]].
Returns: the list of combined reaction centers. eg: [[0, 5, 10], [0, 5, 11], [0, 6, 10], [0, 6, 11], [1, 5, 10], [1, 5, 11], [1, 6, 10], [1, 6, 11], [2, 5, 10], [2, 5, 11], [2, 6, 10], [2, 6, 11]]

static remove_different_bonds(this_compound: md_harmonize.compound.Compound, center_atom_numbers: list, reaction_centers: list) → list[source]¶

Remove the bonds connecting to different atoms. For each reaction center, multiple atoms can be the different atoms. We need to get all the combinations.

Parameters

this_compound – the Compound entity.
center_atom_numbers – the list of atom numbers for center atom in the compound.
reaction_centers – the list of reaction center descriptions for the compound.

Returns

the list of bonds (represented by the atom numbers in the bond) that needs to be removed based on the RDM descriptions.

generate_atom_mappings() → list[source]¶

Generate the one-to-one atom mappings of the compound pair.

Returns: the list of atom mappings.

static detect_components(this_compound: md_harmonize.compound.Compound, removed_bonds: list, center_atom_numbers: list) → list[source]¶

Detect all the components in the compound after removing some bonds. Basic idea is the breadth first search algorithm.

Parameters

this_compound – the Compound entity.
removed_bonds – the list of removed bonds (represented by the atom numbers in the bond) in the compound.
center_atom_numbers – the list of atom numbers of the center atoms in the compound.

Returns

the list of components of the compound represented by a list of atom numbers.

static pair_components(left_components: list, right_components: list) → list[source]¶

The two compounds are divided into separate components due to the difference atoms. We need to pair each component in one compound to its counterpart component in the other compound. Here we roughly pair the components based on the number of atoms in the component. Therefore, every component in one compound can be paired with several components in the other compound.

Parameters

left_components – the components in one compound.
right_components – the components in the other compound.

Returns

the list of paired components.

static construct_component(this_compound: md_harmonize.compound.Compound, atom_numbers: list, removed_bonds: list) → md_harmonize.compound.Compound[source]¶

Construct a Compound entity for the component based on the atom index and removed bonds, facilitating the following atom mappings.

Parameters

this_compound – the Compound entity.
atom_numbers – the list of atom numbers in the component.
removed_bonds – the list of removed bonds (represented by the atom numbers in the bond) in the compound.

Returns

the constructed component compound.

static preliminary_atom_mappings_check(left_component: md_harmonize.compound.Compound, right_component: md_harmonize.compound.Compound) → bool[source]¶

Roughly evaluate if the atoms between the two components can be mapped. We compare if the every atom color in the left component has its counterpart in the right component. Here, we only consider the backbone of the structure.

Parameters

left_component – the component in one compound.
right_component – the component in the other compound.

Returns

bool whether the atoms in the two components can be mapped.

map_components(left_removed_bonds: list, right_removed_bonds: list, left_centers: list, right_centers: list) → dict[source]¶

Find optimal map for every component in the compound pair.

Parameters

left_removed_bonds – the list of removed bonds in one compound.
right_removed_bonds – the list of removed bonds in the other compound.
left_centers – the list of atom numbers of the center atoms in the one compound.
right_centers – the list of atom numbers of the center atoms in the other compound.

Returns

the atom mappings for the compound pair based on the removed bonds and center atoms.

combine_atom_mappings(atom_mappings: list) → dict[source]¶

Combine the atom mappings of all the components. We just mentioned in the pair_components function that every component can have several mappings. Here, we choose the optimal mapping with the least count of changed atom local identifier. And make sure that each atom can only to mapped once.

Parameters: atom_mappings – the list of atom mappings for all the components.
Returns: the atom mappings for the compound pair.

static validate_component_atom_mappings(left_centers: list, right_centers: list, component_atom_mappings: dict) → bool[source]¶

Check if mapped the atoms can correspond to the mapped reaction center atoms.

Parameters

left_centers – the list of center atom indices in the left compound.
right_centers – the list of center atom indices in the right compound.
component_atom_mappings – the one to one atom mappings of one component.

Returns

bool whether the mappings are valid.

count_changed_atom_identifiers(one_to_one_mappings: dict) → int[source]¶

Count the mapped atoms with changed local atom identifier. The different atoms (D in RCLASS definitions) can cause change of local environment, which can change the atom identifier.

Parameters: one_to_one_mappings – the dictionary of atom mappings between the two compounds.
Returns: the total number of mapped atoms with different local identifiers.

md_harmonize.KEGG_parser.create_compound_kcf(kcf_file: str) → Optional[md_harmonize.compound.Compound][source]¶

Construct compound entity based on the KEGG kcf file.

Parameters: kcf_file – the path to the kcf file.
Returns: the constructed compound entity.

md_harmonize.KEGG_parser.create_reactions(reaction_directory: str, atom_mappings: dict) → list[source]¶

Create KEGG Reaction entities.

Parameters

reaction_directory – the directory that stores all the reaction files.
atom_mappings – the compound pair name and its atom mappings.

Returns

the constructed Reaction entities.

md_harmonize.KEGG_parser.compound_pair_mappings(pair_component: tuple) → tuple[source]¶

Get the atom mappings between two compounds based on the rclass definitions.

Parameters: pair_component – a tuple containing the rclass_name, rclass_definitions, one_compound and the_other_compound.
Returns: the compound pair name and its atom mappings.

md_harmonize.KEGG_parser.create_atom_mappings(rclass_directory: str, compounds: dict, seconds: int = 1200) → dict[source]¶

Generate the atom mappings between compounds based on RCLASS definitions.

Parameters

rclass_directory – the directory that stores the rclass files.
compounds – a dictionary of Compound entities.
seconds – the timeout limit.

Returns

the atom mappings of compound pairs.

md_harmonize.MetaCyc_parser¶

This module provides functions to parse MetaCyc text data.

Note: All MetaCyc reactions atom_mappings are stored in a single text file.

md_harmonize.MetaCyc_parser.reaction_side_parser(reaction_side: str) → dict[source]¶

This is to parse FROM_SIDE or TO_SIDE in the reaction.

eg: FROM-SIDE - (CPD-9147 0 8) (OXYGEN-MOLECULE 9 10)

Information includes compound name and the start and end atom index in this compound used for atom mappings. The order of the atoms are the orders in the compound molfile.

Parameters: reaction_side – the text description of reaction side.
Returns: the dictionary of compounds and the corresponding start and end atom index in the atom mappings.

md_harmonize.MetaCyc_parser.generate_one_to_one_mappings(from_side: dict, to_side: dict, indices: str) → list[source]¶

To generate the one to one atom mappings between two the sides of a metabolic reaction.

Parameters

from_side – the dictionary of compounds with their corresponding start and end atom indices in the from_side.
to_side – the dictionary of compounds with their corresponding start and end atom indices in the to_side.
indices – the string representation of mapped atoms.

Returns

the list of mapped atoms between the two sides (from_index, to_index).

md_harmonize.MetaCyc_parser.atom_mappings_parser(atom_mapping_text: list) → dict[source]¶

This is to parse the MetaCyc reaction with atom mappings.

eg:

REACTION - RXN-11981

NTH-ATOM-MAPPING - 1

MAPPING-TYPE - NO-HYDROGEN-ENCODING

FROM-SIDE - (CPD-12950 0 23) (WATER 24 24)

TO-SIDE - (CPD-12949 0 24)

INDICES - 0 1 2 3 5 4 7 6 9 10 11 13 12 14 15 16 17 8 18 19 21 20 22 24 23

note: the INDICES are atom mappings between two sides of the reaction. TO-SIDE[i] is mapped to FROM-SIDE[idx] for i, idx in enumerate(INDICES). Pay attention to the direction!

Parameters: atom_mapping_text – the text descriptions of reactions with atom mappings.
Returns: the dictionary of reactions with atom mappings.

md_harmonize.MetaCyc_parser.reaction_parser(reaction_text: list) → dict[source]¶

This is used to parse MetaCyc reaction.

eg:

UNIQUE-ID - RXN-13583

TYPES - Redox-Half-Reactions

ATOM-MAPPINGS - (:NO-HYDROGEN-ENCODING (1 0 2) (((WATER 0 0) (HYDROXYLAMINE 1 2)) ((NITRITE 0 2))))

CREDITS - SRI

CREDITS - caspi

IN-PATHWAY - HAONITRO-RXN

LEFT - NITRITE

^COMPARTMENT - CCO-IN

LEFT - PROTON

^COEFFICIENT - 5

^COMPARTMENT - CCO-IN

LEFT - E-

^COEFFICIENT - 4

ORPHAN? - :NO

PHYSIOLOGICALLY-RELEVANT? - T

REACTION-BALANCE-STATUS - :BALANCED

REACTION-DIRECTION - LEFT-TO-RIGHT

RIGHT - HYDROXYLAMINE

^COMPARTMENT - CCO-IN

RIGHT - WATER

^COMPARTMENT - CCO-IN

STD-REDUCTION-POTENTIAL - 0.1

//

Parameters: reaction_text – the text descriptions of MetaCyc reactions.
Returns: the dict of parsed MetaCyc reactions.

md_harmonize.MetaCyc_parser.create_reactions(reaction_file: str, atom_mapping_file: str) → list[source]¶

To molfile_name MetaCyc reaction entities.

Parameters

reaction_file – the path to the reaction file.
atom_mapping_file – the path to the atom mapping file.

Returns

the list of constructed Reaction entities.

md_harmonize.aromatics¶

This module provides the AromaticManager class entity.

class md_harmonize.aromatics.AromaticManager(aromatic_substructures: list = None)[source]¶

Two major functions are implemented in AromaticManager.

Extract aromatic substructures based on labelled aromatic atoms (mainly C and N) or Indigo detected aromatic bonds.
The first case only applies to KEGG compounds, and the second case applies to compounds from any databases.
Detect the aromatic substructures in any given compound, and update the bond type of the detected aromatic bonds.

AromaticManager initializer.

Parameters: aromatic_substructures – a list of aromatic substructures.

__init__(aromatic_substructures: list = None) → None[source]¶

AromaticManager initializer.

Parameters: aromatic_substructures – a list of aromatic substructures.

encode() → list[source]¶

To encode the aromatic substructures in the aromatic manager. (Get error when try to jsonpickle the AromaticManager: the cythonized entities cannot be pickled.)

Returns: the list of aromatic substructures.

static decode(aromatic_structures: list)[source]¶

Construct the AromaticManager based on the aromatic substructures.

Parameters: aromatic_structures – the list of aromatic substructures.
Returns: the constructed AromaticManager.

add_aromatic_substructures(substructures: list) → None[source]¶

Add newly detected aromatic structures to the manager. Make sure no duplicates in the aromatic substructures.

Parameters: substructures – a list of aromatic substructures.
Returns: None.

kegg_aromatize(kcf_cpd: md_harmonize.compound.Compound) → None[source]¶

Extract aromatic substructures based on KEGG atom type in KEGG compound parsed from KCF file, and add the newly detected aromatic substructures to the AromaticManager.

Parameters: kcf_cpd – the KEGG compound entity derived from KCF file.
Returns: None.

indigo_aromatize(molfile: str) → None[source]¶

Extract aromatic substructures via Indigo, and add the newly detected aromatic substructures to the AromaticManager.

Parameters: molfile – the path to the molfile.
Returns: None.

indigo_aromatic_bonds(molfile: str) → set[source]¶

Detect the aromatic bonds in the compound via Indigo method.

Parameters: molfile – the path to the molfile.
Returns: the set of aromatic bonds represented by first_atom_number and second_atom_number in the bond.

static fuse_cycles(cycles: list) → list[source]¶

To fuse the cycles with shared atoms.

Parameters: cycles – the list of cycles represented by atom numbers.
Returns: the list of cleaned cycles.

detect_aromatic_substructures_timeout(cpd: md_harmonize.compound.Compound) → None[source]¶

Detect the aromatic substructures in the compound and stop the search on timeout.

Parameters: cpd – the Compound entity.
Returns: None.

detect_aromatic_substructures(cpd: md_harmonize.compound.Compound) → None[source]¶

Detect all the aromatic substructures in the cpd, and update the bond type of aromatic bonds.

Parameters: cpd – the Compound entity.
Returns: None.

static construct_aromatic_entity(cpd: md_harmonize.compound.Compound, aromatic_cycles: list) → list[source]¶

Construct the aromatic substructure entity based on the aromatic atoms. Here, we also include outside atoms that are connected to aromatic rings with double bonds.

Parameters

cpd – the Compound entity.
aromatic_cycles – the list of aromatic cycles represented by atom numbers in the compound.

Returns

the list of constructed aromatic substructures.

extract_aromatic_substructures(cpd: md_harmonize.compound.Compound) → list[source]¶

Detect the aromatic substructures in a compound based on the aromatic atoms. This only applies to KEGG kcf file.

Parameters: cpd – the Compound entity.
Returns: the list of aromatic cycles represented by atom numbers.

md_harmonize.harmonization¶

This module provides the HarmonizedEdge class, the HarmonizedCompoundEdge class, and the HarmonizedReactionEdge class .

class md_harmonize.harmonization.HarmonizedEdge(one_side: Union[md_harmonize.compound.Compound, md_harmonize.reaction.Reaction], other_side: Union[md_harmonize.compound.Compound, md_harmonize.reaction.Reaction], relationship: int, edge_type: Union[str, int], mappings: dict)[source]¶

The HarmonizedEdge representing compound or reaction pairs.

HarmonizedEdge initializer.

Parameters

one_side – one side of the edge. This can be compound or reaction.
other_side – the other side of the edge. This can be compound or reaction.
relationship – equivalent, generic-specific, or loose.
edge_type – for compound edge, this represents resonance, linear-circular, r group, same structure; for reaction edge, this represents 3 level match or 4 level match.
mappings – for compound edge, the mappings refer to mapped atoms between compounds; for reaction edge, the mappings refer to mapped compounds between reaction.

__init__(one_side: Union[md_harmonize.compound.Compound, md_harmonize.reaction.Reaction], other_side: Union[md_harmonize.compound.Compound, md_harmonize.reaction.Reaction], relationship: int, edge_type: Union[str, int], mappings: dict) → None[source]¶

HarmonizedEdge initializer.

Parameters

one_side – one side of the edge. This can be compound or reaction.
other_side – the other side of the edge. This can be compound or reaction.
relationship – equivalent, generic-specific, or loose.
edge_type – for compound edge, this represents resonance, linear-circular, r group, same structure; for reaction edge, this represents 3 level match or 4 level match.
mappings – for compound edge, the mappings refer to mapped atoms between compounds; for reaction edge, the mappings refer to mapped compounds between reaction.

property reversed_relationship¶

Get the relationship between the other side and one side.

Returns: the reversed relationship.

pair_relationship(name: str) → int[source]¶

When we map compounds in the reaction, we can access the compound edge from either side.

Parameters: name – the name of the searched one side.
Returns: the relationship of the searched pair.

class md_harmonize.harmonization.HarmonizedCompoundEdge(one_compound: md_harmonize.compound.Compound, other_compound: md_harmonize.compound.Compound, relationship: int, edge_type: str, atom_mappings: dict)[source]¶

The HarmonizedCompoundEdge representing compound pairs.

HarmonizedCompoundEdge initializer.

Parameters

one_compound – one Compound entity in the compound pair.
other_compound – the other Compound entity in the compound pair.
relationship – the relationship (equivalent, generic-specific, or loose) between the two compounds.
edge_type – the edge type can be resonance, linear-circular, r group, or same structure.
atom_mappings – the atom mappings between the two compounds.

__init__(one_compound: md_harmonize.compound.Compound, other_compound: md_harmonize.compound.Compound, relationship: int, edge_type: str, atom_mappings: dict) → None[source]¶

HarmonizedCompoundEdge initializer.

Parameters

one_compound – one Compound entity in the compound pair.
other_compound – the other Compound entity in the compound pair.
relationship – the relationship (equivalent, generic-specific, or loose) between the two compounds.
edge_type – the edge type can be resonance, linear-circular, r group, or same structure.
atom_mappings – the atom mappings between the two compounds.

property reversed_mappings¶

Get the atom mappings from compound on the other side to compound on the one side.

Returns: atom mappings between the other side compound to one side compound.

pair_atom_mappings(name: str) → dict[source]¶

Get the atom mappings of the harmonized compound edge, where one side equals to the parameter name.

Parameters: name – the compound name.
Returns: the atom mappings.

class md_harmonize.harmonization.HarmonizedReactionEdge(one_reaction: md_harmonize.reaction.Reaction, other_reaction: md_harmonize.reaction.Reaction, relationship: int, edge_type: int, compound_mappings: dict)[source]¶

The HarmonizedReactionEdge to represent reaction pairs.

HarmonizedReactionEdge initializer.

Parameters

one_reaction – one Reaction entity in the reaction pair.
other_reaction – the other Reaction entity in the reaction pair.
relationship – the relationship (equivalent, generic-specific, or loose) between the two reactions.
edge_type – the reactions can be 3-level EC or 4-level EC paired.
compound_mappings – the dictionary of paired compounds in the reaction pair.

__init__(one_reaction: md_harmonize.reaction.Reaction, other_reaction: md_harmonize.reaction.Reaction, relationship: int, edge_type: int, compound_mappings: dict) → None[source]¶

HarmonizedReactionEdge initializer.

Parameters

one_reaction – one Reaction entity in the reaction pair.
other_reaction – the other Reaction entity in the reaction pair.
relationship – the relationship (equivalent, generic-specific, or loose) between the two reactions.
edge_type – the reactions can be 3-level EC or 4-level EC paired.
compound_mappings – the dictionary of paired compounds in the reaction pair.

class md_harmonize.harmonization.HarmonizationManager[source]¶

The HarmonizationManger is responsible for adding, removing or searching harmonized edge.

HarmonizationManager initializer.

__init__() → None[source]¶: HarmonizationManager initializer.

save_manager() → list[source]¶

Save all the names of harmonized edges.

Returns: the list of harmonized edges.

static create_key(name_1: str, name_2: str) → str[source]¶

Create the edge key. Each edge is represented by a unique key in the harmonized_edges dictionary.

Parameters

name_1 – the name of one side of the edge.
name_2 – the name of the other side of the edge.

Returns

the key of the edge.

add_edge(edge: Union[md_harmonize.harmonization.HarmonizedCompoundEdge, md_harmonize.harmonization.HarmonizedReactionEdge]) → bool[source]¶

Add this edge to the harmonized edges.

Parameters: edge – the HarmonizedEdge entity.
Returns: bool whether the edge is added successfully.

remove_edge(edge: Union[md_harmonize.harmonization.HarmonizedCompoundEdge, md_harmonize.harmonization.HarmonizedReactionEdge]) → bool[source]¶

Remove this edge from the harmonized edges.

Parameters: edge – the HarmonizedEdge entity.
Returns: bool whether the edge is removed successfully.

search(name_1: str, name_2: str) → Union[md_harmonize.harmonization.HarmonizedCompoundEdge, md_harmonize.harmonization.HarmonizedReactionEdge, None][source]¶

Search the edge based on the names of the two sides.

Parameters

name_1 – the name of one side of the edge.
name_2 – the name of the other side of the edge.

Returns

edge if the edge exists or None.

class md_harmonize.harmonization.CompoundHarmonizationManager[source]¶

The CompoundHarmonizationManager is responsible for adding, removing or searching HarmonizedCompoundEdge.

CompoundHarmonizationManager initializer.

__init__() → None[source]¶: CompoundHarmonizationManager initializer.

static find_compound(compound_dict: list, compound_name: str) → Optional[md_harmonize.compound.Compound][source]¶

Find the Compound based on the compound name in the compound dict.

Parameters

compound_dict – a list of compound dictionaries.
compound_name – the target compound name.

Returns

the Compound.

static create_manager(compound_dict: list, compound_pairs: list)[source]¶

Create the CompoundHarmonizationManager based on the compound paris.

Parameters

compound_dict – the list of compound dictionaries.
compound_pairs – the list of compound pairs.

Returns

the CompoundHarmonizationManager

add_edge(edge: md_harmonize.harmonization.HarmonizedCompoundEdge) → bool[source]¶

Add a newly detected edge to the manager, and update the occurrences of compound in the harmonized edges. This is for calculating the jaccard index.

Parameters: edge – the HarmonizedCompoundEdge entity.
Returns: bool whether the edge is added successfully.

remove_edge(edge: md_harmonize.harmonization.HarmonizedCompoundEdge) → bool[source]¶

Remove the edge from the manager, and update the occurrences of compound in the harmonized edges.

Parameters: edge – the HarmonizedCompoundEdge entity.
Returns: bool whether the edge is removed successfully.

has_visited(name_1: str, name_2: str) → bool[source]¶

Check if the compound pair has been visited.

Parameters

name_1 – the name of one side of the edge.
name_2 – the name of the other side of the edge.

Returns

bool if the pair has been visited.

add_invalid(name_1: str, name_2: str) → None[source]¶

Add the name of invalid compound pair to the visited.

Parameters

name_1 – the name of one side of the edge.
name_2 – the name of the other side of the edge.

Returns

None.

get_edge_list() → list[source]¶

Get the names of all the harmonized edges.

Returns: the list of names of harmonized edges.

class md_harmonize.harmonization.ReactionHarmonizationManager(compound_harmonization_manager: md_harmonize.harmonization.CompoundHarmonizationManager)[source]¶

The ReactionHarmonizationManager is responsible for adding, removing or searching HarmonizedReactionEdge.

ReactionHarmonizationManager initializer.

Parameters: compound_harmonization_manager – the CompoundHarmonizationManager entity for compound pairs management.

__init__(compound_harmonization_manager: md_harmonize.harmonization.CompoundHarmonizationManager) → None[source]¶

ReactionHarmonizationManager initializer.

Parameters: compound_harmonization_manager – the CompoundHarmonizationManager entity for compound pairs management.

static compare_ecs(one_ecs: dict, other_ecs: dict) → int[source]¶

Compare two lists of EC numbers.

Parameters

one_ecs – the dict of EC numbers of one reaction.
other_ecs – the dict of EC numbers of the other reaction.

Returns

the level of EC number that they can be matched.

static determine_relationship(relationships: list) → int[source]¶

Determine the relationship of the reaction pair based on the relationship of paired compounds.

Parameters: relationships – the list of relationship of compound pairs in the two reactions.
Returns: the relationships between the two reactions.

harmonize_reaction(one_reaction: md_harmonize.reaction.Reaction, other_reaction: md_harmonize.reaction.Reaction) → None[source]¶

Test if two reactions can be harmonized.

Parameters

one_reaction – one Reaction that is involved in the reaction pair.
other_reaction – the other Reaction that is involved in the reaction pair.

Returns

None.

compound_mappings(one_compounds: list, other_compounds: list) → dict[source]¶

Get the mapped compounds in the two compound lists.

Parameters

one_compounds – one list of Compound entities.
other_compounds – the other list of Compound entities.

Returns

the dictionary of paired compounds with their relationship. The relationship will be used to determine the relationship of reaction pair.

unmapped_compounds(one_compounds: list, other_compounds: list, mappings: dict) → tuple[source]¶

Get the compounds that cannot be mapped. This can lead to new compound pairs.

Parameters

one_compounds – one list of Compound entities.
other_compounds – the other list of Compound entities.
mappings – the mapped compounds between the two compound lists.

Returns

two lists of compounds that cannot be mapped.

match_unmapped_compounds(one_side_left: list, other_side_left: list) → None[source]¶

Match the left compounds and add the valid compound pairs to the CompoundHarmonizationManager. We also add the invalid compound pairs to the CompoundHarmonizationManager to avoid redundant match.

Parameters

one_side_left – one list of left Compound entities.
other_side_left – the other list of left Compound entities.

Returns

None.

jaccard(one_compounds: list, other_compounds: list, mappings: dict) → float[source]¶

Calculate the jaccard index between the two list of compounds.

Parameters

one_compounds – one list of Compound entities.
other_compounds – the other list of Compound entities.
mappings – the dictionary of mapped compounds between the two compound lists.

Returns

the jaccard index of the two compound lists.

one_to_one_compound_mappings(mappings: dict) → Optional[tuple][source]¶

Find the one-to-one compound mappings between the two reactions. This step is to avoid very extreme cases that a compound in one reaction can be mapped to two or more compounds in the other reaction.

Parameters: mappings – the dictionary of compound mappings.
Returns: the tuple of relationship of compound pairs and dictionary of one-to-one compound mappings.

md_harmonize.harmonization.harmonize_compound_list(compound_dict_list: list) → md_harmonize.harmonization.CompoundHarmonizationManager[source]¶

Harmonize compounds across different databases based on the compound coloring identifier.

Parameters: compound_dict_list – the list of Compound dictionary from different sources.
Returns: the CompoundHarmonizationManager containing harmonized compound edges.

md_harmonize.harmonization.harmonize_reaction_list(reaction_lists: list, compound_harmonization_manager: md_harmonize.harmonization.CompoundHarmonizationManager) → md_harmonize.harmonization.ReactionHarmonizationManager[source]¶

Harmonize reactions across different sources based on the harmonized compounds. At the same time, this also harmonizes compound pairs with resonance, linear-circular, r group types.

Parameters

reaction_lists – a list of Reaction lists from different sources.
compound_harmonization_manager – a CompoundHarmonizationManager containing harmonized compound pairs with the same structure.

Returns

ReactionHarmonizationManager

The md_harmonize API Reference¶

md_harmonize.compound¶

md_harmonize.reaction¶

md_harmonize.KEGG_database_scraper¶

md_harmonize.KEGG_parser¶

md_harmonize.MetaCyc_parser¶

md_harmonize.aromatics¶

md_harmonize.harmonization¶

md_harmonize

Navigation

Related Topics