Tutorial

Note: Many KEGG entry IDs contain colons and kegg_pull saves KEGG entry files with their ID in the file name. When running on Windows, all file names with colons will have their colons replaced with underscores.

API

`SinglePull`

You can pull (request and save to the file system or save in memory) a limited number of KEGG entries using the SinglePull class with the list of entry IDs as input. The pull method stores the entries in the file system and the pull_dict method stores them in memory, specifically in a dictionary. The number of entries is limited because only one request is made to the KEGG REST API and KEGG places a limit on the number of entries that can be pulled with a single request. The output parameter in the pull method specifies where the entries are saved based on whether output is a directory or ZIP archive if it ends in “.zip”. The PullResult object that’s returned tells you which of the entry IDs succeeded, which failed, and which timed out. In this example, the entry ID succeeds.

`pull`

import kegg_pull.pull as p
single_pull = p.SinglePull()
entry_ids = ['br:br08902']
pull_result = single_pull.pull(entry_ids=entry_ids, output='pull-entries/')
print(pull_result)

Successful Entry Ids: br:br08902
Failed Entry Ids: none
Timed Out Entry Ids: none

In this example, the entry ID fails.

single_pull.pull(['br:br03220'], output='pull-entries/')

Successful Entry Ids: none
Failed Entry Ids: br:br03220
Timed Out Entry Ids: none

When SinglePull pulls multiple entries at a time, they are automatically separated from one another and saved in individual files.

single_pull.pull(entry_ids=['cpd:C00001', 'cpd:C00002', 'cpd:C00003'], output='pull-entries/')

Successful Entry Ids: cpd:C00001, cpd:C00002, cpd:C00003
Failed Entry Ids: none
Timed Out Entry Ids: none

An exception is thrown if you attempt to provide more entry IDs to the pull() method than what is accepted by KEGG.

import logging as log

try:
    single_pull.pull(entry_ids=['cpd:C00001', 'cpd:C00002', 'cpd:C00003', 'cpd:C00004', 'cpd:C00005', 'cpd:C00006',
        'cpd:C00007', 'cpd:C00008', 'cpd:C00009', 'cpd:C00010', 'cpd:C00011'], output='pull-entries/')
except ValueError as error:
    log.error(error)

ERROR:root:Cannot create URL - The maximum number of entry IDs is 10 but 11 were provided

`pull_dict`

The pull_dict method pulls entries just like the pull method except that it stores the entries in memory rather than in the file system. This is helpful if the user of the API wishes to use the entries immediately and doesn’t need them to be stored persistently. Specifically, pull_dict returns a tuple containing the PullResult as described above along with a dictionary mapping the entry ID to the entry itself.

pull_result, compounds = single_pull.pull_dict(
    entry_ids=['cpd:C00001', 'cpd:C00002', 'cpd:C00003'], entry_field='mol')

pull_result

Successful Entry Ids: cpd:C00001, cpd:C00002, cpd:C00003
Failed Entry Ids: none
Timed Out Entry Ids: none

The entries can be accessed from the dictionary using the provided entry IDs as keys.

print(compounds['cpd:C00001'])

  3  2  0  0  0  0  0  0  0  0999 V2000
   22.1250  -16.2017    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   23.6000  -15.2112    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
   20.7129  -15.2859    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0     0  0
  1  3  1  0     0  0
M  END

> <ENTRY>
cpd:C00001

`SingleProcessMultiplePull` and `MultiProcessMultiplePull`

To get past the limit on the number of entries that can be pulled at a time, we have two classes capable of pulling an arbitrary number of entries. There’s the SingleProcessMultiplePull and MultiProcessMultiplePull. MultiProcessMultiplePull will likely pull faster since it pulls within multiple processes, but it requires multiple cores. Like SinglePull, these two classes have both a pull method and pull_dict method which respectively return a PullResult and a tuple containing a pull result and dictionary.

multiple_pull = p.SingleProcessMultiplePull()

entry_ids = [
    'cpd:C00001',
    'cpd:C00002',
    'cpd:C00003',
    'cpd:C00004',
    'cpd:C00005',
    'cpd:C00006',
    'cpd:C00007',
    'cpd:C00008',
    'cpd:C00009',
    'cpd:C00010',
    'cpd:C00011'
]

multiple_pull.pull(entry_ids, output='pull-entries/')

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:02<00:00,  3.72it/s]

Successful Entry Ids: cpd:C00001, cpd:C00002, cpd:C00003, cpd:C00004, cpd:C00005, cpd:C00006, cpd:C00007, cpd:C00008, cpd:C00009, cpd:C00010, cpd:C00011
Failed Entry Ids: none
Timed Out Entry Ids: none

You can specify the number of processes to use for MultiProcessMultiplePull with the n_workers parameter, which defaults to the number of cores available.

multiple_pull = p.MultiProcessMultiplePull(n_workers=2)
multiple_pull.pull(entry_ids, output='pull-entries/')

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:01<00:00,  6.26it/s]

Successful Entry Ids: cpd:C00001, cpd:C00002, cpd:C00003, cpd:C00004, cpd:C00005, cpd:C00006, cpd:C00007, cpd:C00008, cpd:C00009, cpd:C00010, cpd:C00011
Failed Entry Ids: none
Timed Out Entry Ids: none

The pull_dict method is also available:

pull_result, compounds = multiple_pull.pull_dict(entry_ids, entry_field='mol')

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:01<00:00, 10.72it/s]

print(compounds['cpd:C00011'])

  3  2  0  0  0  0  0  0  0  0999 V2000
   21.8400  -11.9918    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   20.6288  -12.6940    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   23.0512  -12.6940    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  2  0     0  0
  1  3  2  0     0  0
M  END

> <ENTRY>
cpd:C00011

$$$$

Entry IDs

The entry_ids module provides a number of different ways to pull a list of KEGG entry IDs.

import kegg_pull.entry_ids as ei
entry_ids = ei.from_database('brite')
print(entry_ids)

['br:br08901', 'br:br08902', 'br:br08904', 'br:br08906', 'br:ko00001', 'br:ko00002', 'br:ko00003', 'br:br08907', 'br:ko01000', 'br:ko01001', 'br:ko01009', 'br:ko01002', 'br:ko01003', 'br:ko01005', 'br:ko01011', 'br:ko01004', 'br:ko01008', 'br:ko01006', 'br:ko01007', 'br:ko00199', 'br:ko00194', 'br:ko03000', 'br:ko03021', 'br:ko03019', 'br:ko03041', 'br:ko03011', 'br:ko03009', 'br:ko03016', 'br:ko03012', 'br:ko03110', 'br:ko04131', 'br:ko04121', 'br:ko03051', 'br:ko03032', 'br:ko03036', 'br:ko03400', 'br:ko03029', 'br:ko02000', 'br:ko02044', 'br:ko02042', 'br:ko02022', 'br:ko02035', 'br:ko03037', 'br:ko04812', 'br:ko04147', 'br:ko02048', 'br:ko04030', 'br:ko04050', 'br:ko04054', 'br:ko03310', 'br:ko04040', 'br:ko04031', 'br:ko04052', 'br:ko04515', 'br:ko04090', 'br:ko01504', 'br:ko00535', 'br:ko00536', 'br:ko00537', 'br:ko04091', 'br:ko04990', 'br:ko03200', 'br:ko03210', 'br:ko03100', 'br:br08001', 'br:br08002', 'br:br08003', 'br:br08005', 'br:br08006', 'br:br08007', 'br:br08009', 'br:br08021', 'br:br08120', 'br:br08201', 'br:br08202', 'br:br08204', 'br:br08203', 'br:br08303', 'br:br08302', 'br:br08301', 'br:br08313', 'br:br08312', 'br:br08304', 'br:br08305', 'br:br08331', 'br:br08330', 'br:br08332', 'br:br08310', 'br:br08307', 'br:br08327', 'br:br08311', 'br:br08402', 'br:br08401', 'br:br08403', 'br:br08411', 'br:br08410', 'br:br08420', 'br:br08601', 'br:br08610', 'br:br08611', 'br:br08612', 'br:br08613', 'br:br08614', 'br:br08615', 'br:br08620', 'br:br08621', 'br:br08605', 'br:br03220', 'br:br03222', 'br:br03223', 'br:br01610', 'br:br01611', 'br:br01612', 'br:br01613', 'br:br01601', 'br:br01602', 'br:br01600', 'br:br01620', 'br:br01553', 'br:br01554', 'br:br01556', 'br:br01555', 'br:br01557', 'br:br01800', 'br:br01810', 'br:br08011', 'br:br08020', 'br:br08012', 'br:br08319', 'br:br08329', 'br:br08318', 'br:br08328', 'br:br08309', 'br:br08341', 'br:br08324', 'br:br08317', 'br:br08315', 'br:br08314', 'br:br08442', 'br:br08441', 'br:br08431']

Entry ID Mappings

The map module converts the output of the KEGG REST API “link” operation or “conv” operation into dictionaries usable in python code.

“link” operation

import kegg_pull.map as kmap

pathway_to_compound = kmap.entries_link(entry_ids=['path:map00010', 'path:map00020'], target_database='compound')
print(pathway_to_compound)

{'path:map00010': {'cpd:C00186', 'cpd:C00033', 'cpd:C00036', 'cpd:C00668', 'cpd:C06188', 'cpd:C01172', 'cpd:C05378', 'cpd:C00118', 'cpd:C06186', 'cpd:C00084', 'cpd:C05125', 'cpd:C00074', 'cpd:C00267', 'cpd:C00024', 'cpd:C00631', 'cpd:C00469', 'cpd:C00236', 'cpd:C00103', 'cpd:C00197', 'cpd:C00221', 'cpd:C00068', 'cpd:C15972', 'cpd:C00031', 'cpd:C00111', 'cpd:C01159', 'cpd:C16255', 'cpd:C15973', 'cpd:C05345', 'cpd:C00022', 'cpd:C06187', 'cpd:C01451'}, 'path:map00020': {'cpd:C00036', 'cpd:C00091', 'cpd:C00042', 'cpd:C00122', 'cpd:C05125', 'cpd:C00074', 'cpd:C00024', 'cpd:C00026', 'cpd:C16254', 'cpd:C00068', 'cpd:C15972', 'cpd:C00158', 'cpd:C00149', 'cpd:C00417', 'cpd:C16255', 'cpd:C15973', 'cpd:C05381', 'cpd:C00311', 'cpd:C05379', 'cpd:C00022'}}

“conv” operation

kegg_to_pubchem = kmap.entries_conv(entry_ids=['cpd:C00001', 'cpd:C00002'], target_database='pubchem')
print(kegg_to_pubchem)

{'cpd:C00001': {'pubchem:3303'}, 'cpd:C00002': {'pubchem:3304'}}

Pathway Organizer

The pathway_organizer module flattens a brite hierarchy into a mapping of the IDs of its nodes to information about those nodes.

import kegg_pull.pathway_organizer as po
pathway_org = po.PathwayOrganizer.load_from_kegg()
print(pathway_org.hierarchy_nodes['Metabolism'])

{'name': 'Metabolism', 'level': 1, 'parent': None, 'children': ['Amino acid metabolism', 'Biosynthesis of other secondary metabolites', 'Carbohydrate metabolism', 'Chemical structure transformation maps', 'Energy metabolism', 'Global and overview maps', 'Glycan biosynthesis and metabolism', 'Lipid metabolism', 'Metabolism of cofactors and vitamins', 'Metabolism of other amino acids', 'Metabolism of terpenoids and polyketides', 'Nucleotide metabolism', 'Xenobiotics biodegradation and metabolism'], 'entry_id': None}

Rest API

The KEGGrest class provides low-level wrapper methods for the KEGG REST API, including all of its operations. The resulting KEGGresponse object contains both the text and binary versions of the response body, the status of the response (one of SUCCESS, FAILED, or TIMEOUT), and the internal URL used to request from the KEGG REST API. For most use-cases, the higher-level kegg_pull functionality will be preferred over using the lower-level KEGGrest class. In fact, most of kegg_pull’s higher-level functionality relies on the KEGGrest class, but provies a more useful programming interface.

import kegg_pull.rest as r
kegg_rest = r.KEGGrest()
kegg_response = kegg_rest.info(database='module')

kegg_response.status

<Status.SUCCESS: 1>

kegg_response.text_body

'module           KEGG Module Databasenmd               Release 106.0+/04-07, Apr 23n                 Kanehisa Laboratoriesn                 550 entriesnnlinked db        pathwayn                 kon                 <org>n                 genomen                 compoundn                 glycann                 reactionn                 enzymen                 diseasen                 pubmedn'

kegg_response.kegg_url

https://rest.kegg.jp/info/module

CLI

The command line interface has 4 subcommands: pull, entry-ids, map, and rest. They are analogous to the API modules and methods.

pull

From a user-specified list of entry IDs

% kegg_pull pull entry-ids cpd:C00001,cpd:C00002,cpd:C00003 --output=compound-entries/

100%|█████████████████████████████████████████████| 3/3 [00:01<00:00,  1.99it/s]

% head compound-entries/cpd:C00001.txt

ENTRY       C00001                      Compound
NAME        H2O;
            Water
FORMULA     H2O
EXACT_MASS  18.0106
MOL_WEIGHT  18.0153
REMARK      Same as: D00001
REACTION    R00001 R00002 R00004 R00005 R00009 R00010 R00011 R00017
            R00022 R00024 R00025 R00026 R00028 R00036 R00041 R00044
            R00045 R00047 R00048 R00052 R00053 R00054 R00055 R00056

The pull subcommand creates a pull-results.jsonfile. You can load it as a dictionary using the python json library.

import json as j

with open('pull-results.json', 'r') as file:
    pull_results = j.load(file)

print(pull_results)

{'percent-success': 100.0, 'pull-minutes': 0.03, 'num-successful': 3, 'num-failed': 0, 'num-timed-out': 0, 'num-total': 3, 'successful-entry-ids': ['cpd:C00001', 'cpd:C00002', 'cpd:C00003'], 'failed-entry-ids': [], 'timed-out-entry-ids': []}

Below is what the pull-results.json file contents look like:

% cat pull-results.json

{
"percent-success": 100.0,
"pull-minutes": 0.03,
"num-successful": 3,
"num-failed": 0,
"num-timed-out": 0,
"num-total": 3,
"successful-entry-ids": [
"cpd:C00001",
"cpd:C00002",
"cpd:C00003"
],
"failed-entry-ids": [],
"timed-out-entry-ids": []
}

Entry IDs can also be passed in from standard input when the <entry-ids> option is equal to - rather than a comma-separated list. This example saves the entries to a ZIP archive.

standard_input = """
cpd:C00001
cpd:C00002
cpd:C00003
"""

with open('standard_input.txt', 'w') as file:
    file.write(standard_input)

% cat standard_input.txt | kegg_pull pull entry-ids - --output=compound-entries.zip

100%|█████████████████████████████████████████████| 3/3 [00:01<00:00,  2.04it/s]

From a database

% kegg_pull pull database brite --multi-process --n-workers=11 --output=brite-entries/

100%|█████████████████████████████████████████| 141/141 [00:19<00:00,  7.33it/s]

% ls brite-entries/

br:br08001.txt      br:br08314.txt  br:br08610.txt  br:ko01003.txt  br:ko03036.txt
br:br08002.txt      br:br08315.txt  br:br08611.txt  br:ko01004.txt  br:ko03037.txt
br:br08003.txt      br:br08317.txt  br:br08612.txt  br:ko01005.txt  br:ko03041.txt
br:br08005.txt      br:br08318.txt  br:br08613.txt  br:ko01006.txt  br:ko03051.txt
br:br08006.txt      br:br08319.txt  br:br08614.txt  br:ko01007.txt  br:ko03100.txt
br:br08007.txt      br:br08324.txt  br:br08615.txt  br:ko01008.txt  br:ko03110.txt
br:br08009.txt      br:br08327.txt  br:br08620.txt  br:ko01009.txt  br:ko03200.txt
br:br08021.txt      br:br08328.txt  br:br08621.txt  br:ko01011.txt  br:ko03210.txt
br:br08120.txt      br:br08329.txt  br:br08901.txt  br:ko01504.txt  br:ko03310.txt
br:br08201.txt      br:br08330.txt  br:br08902.txt  br:ko02000.txt  br:ko03400.txt
br:br08202.txt      br:br08331.txt  br:br08904.txt  br:ko02022.txt  br:ko04030.txt
br:br08203.txt      br:br08332.txt  br:br08906.txt  br:ko02035.txt  br:ko04031.txt
br:br08204.txt      br:br08341.txt  br:br08907.txt  br:ko02042.txt  br:ko04040.txt
br:br08301.txt      br:br08401.txt  br:ko00001.txt  br:ko02044.txt  br:ko04050.txt
br:br08302.txt      br:br08402.txt  br:ko00002.txt  br:ko02048.txt  br:ko04052.txt
br:br08303.txt      br:br08403.txt  br:ko00003.txt  br:ko03000.txt  br:ko04054.txt
br:br08304.txt      br:br08410.txt  br:ko00194.txt  br:ko03009.txt  br:ko04090.txt
br:br08305.txt      br:br08411.txt  br:ko00199.txt  br:ko03011.txt  br:ko04091.txt
br:br08307.txt      br:br08420.txt  br:ko00535.txt  br:ko03012.txt  br:ko04121.txt
br:br08309.txt      br:br08431.txt  br:ko00536.txt  br:ko03016.txt  br:ko04131.txt
br:br08310.txt      br:br08441.txt  br:ko00537.txt  br:ko03019.txt  br:ko04147.txt
br:br08311.txt      br:br08442.txt  br:ko01000.txt  br:ko03021.txt  br:ko04515.txt
br:br08312.txt      br:br08601.txt  br:ko01001.txt  br:ko03029.txt  br:ko04812.txt
br:br08313.txt      br:br08605.txt  br:ko01002.txt  br:ko03032.txt  br:ko04990.txt

% head pull-results.json

{
"percent-success": 85.11,
"pull-minutes": 0.32,
"num-successful": 120,
"num-failed": 21,
"num-timed-out": 0,
"num-total": 141,
"successful-entry-ids": [
"br:br08901",
"br:br08902",

Printing Entries

Alternatively, you can print the KEGG entries to the screen rather than saving them in separate files.

% kegg_pull pull entry-ids C00001,C00007 --entry-field=mol --print

100%|█████████████████████████████████████████████| 2/2 [00:00<00:00,  2.56it/s]
C00001



  3  2  0  0  0  0  0  0  0  0999 V2000
   22.1250  -16.2017    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   23.6000  -15.2112    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
   20.7129  -15.2859    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0     0  0
  1  3  1  0     0  0
M  END

> <ENTRY>
cpd:C00001



C00007




  2  1  0  0  0  0  0  0  0  0999 V2000
   24.3446  -17.0048    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   25.7446  -17.0048    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  2  0     0  0
M  END

> <ENTRY>
cpd:C00007

entry-ids

% kegg_pull entry-ids database brite --output=brite-entry-ids.txt
% head brite-entry-ids.txt

br:br08901
br:br08902
br:br08904
br:br08906
br:ko00001
br:ko00002
br:ko00003
br:br08907
br:ko01000
br:ko01001

% kegg_pull entry-ids molec-attr drug --em=433 --em=434

dr:D00752
dr:D00892
dr:D02110
dr:D02114
dr:D02238
dr:D03088
dr:D04789
dr:D05806
dr:D05911
dr:D06342
dr:D07084
dr:D07761
dr:D07879
dr:D08757
dr:D09567
dr:D10084
dr:D10309
dr:D10661
dr:D11316

% kegg_pull entry-ids keywords compound protein,enzyme

cpd:C05197
cpd:C05312
cpd:C15972
cpd:C15973

map

% kegg_pull map link entry-ids path:map00010,path:map00020 compound --output=mapping.json
% head mapping.json

{
  "path:map00010": [
    "cpd:C00022",
    "cpd:C00024",
    "cpd:C00031",
    "cpd:C00033",
    "cpd:C00036",
    "cpd:C00068",
    "cpd:C00074",
    "cpd:C00084",

pathway-organizer

% kegg_pull pathway-organizer --tln=Metabolism --fn="Global and overview maps,Carbohydrate metabolism" --output=hierarchy-nodes.json
% head hierarchy-nodes.json

{
  "path:map00190": {
    "name": "00190  Oxidative phosphorylation",
    "level": 3,
    "parent": "Energy metabolism",
    "children": null,
    "entry_id": "path:map00190"
  },
  "path:map00195": {
    "name": "00195  Photosynthesis",

rest

% kegg_pull rest info enzyme

enzyme           KEGG Enzyme Database
ec               Release 106.0+/04-07, Apr 23
                 Kanehisa Laboratories
                 8,056 entries

linked db        pathway
                 module
                 ko
                 <org>
                 vg
                 vp
                 ag
                 compound
                 glycan
                 reaction
                 rclass

% kegg_pull rest get cpd:C00007 --entry-field=mol

  2  1  0  0  0  0  0  0  0  0999 V2000
   24.3446  -17.0048    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   25.7446  -17.0048    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  2  0     0  0
M  END

> <ENTRY>
cpd:C00007

$$$$

% kegg_pull rest conv entry-ids gl:G13143,gl:G13141,gl:G13139 pubchem

gl:G13143   pubchem:405226698
gl:G13141   pubchem:405226697
gl:G13139   pubchem:405226696

The rest subcommand additionally offers the --test option to determine if a request to the KEGG REST API will fail or pass before actually executing the command.

% kegg_pull rest ddi invalid-drug-entry-id --test

False

Tutorial

API

SinglePull

pull

pull_dict

SingleProcessMultiplePull and MultiProcessMultiplePull

Entry IDs

Entry ID Mappings

“link” operation

“conv” operation

Pathway Organizer

Rest API

CLI

pull

From a user-specified list of entry IDs

From a database

Printing Entries

entry-ids

map

pathway-organizer

rest

`SinglePull`

`pull`

`pull_dict`

`SingleProcessMultiplePull` and `MultiProcessMultiplePull`