Tutorial
Note: Many KEGG entry IDs contain colons and kegg_pull
saves
KEGG entry files with their ID in the file name. When running on
Windows, all file names with colons will have their colons replaced with
underscores.
API
SinglePull
You can pull (request and save to the file system or save in memory) a
limited number of KEGG entries using the SinglePull
class with the
list of entry IDs as input. The pull
method stores the entries in
the file system and the pull_dict
method stores them in memory,
specifically in a dictionary. The number of entries is limited because
only one request is made to the KEGG REST API and KEGG places a limit on
the number of entries that can be pulled with a single request. The
output
parameter in the pull
method specifies where the entries
are saved based on whether output
is a directory or ZIP archive if
it ends in “.zip”. The PullResult
object that’s returned tells you
which of the entry IDs succeeded, which failed, and which timed out. In
this example, the entry ID succeeds.
pull
import kegg_pull.pull as p
single_pull = p.SinglePull()
entry_ids = ['br:br08902']
pull_result = single_pull.pull(entry_ids=entry_ids, output='pull-entries/')
print(pull_result)
Successful Entry Ids: br:br08902
Failed Entry Ids: none
Timed Out Entry Ids: none
In this example, the entry ID fails.
single_pull.pull(['br:br03220'], output='pull-entries/')
Successful Entry Ids: none
Failed Entry Ids: br:br03220
Timed Out Entry Ids: none
When SinglePull
pulls multiple entries at a time, they are
automatically separated from one another and saved in individual files.
single_pull.pull(entry_ids=['cpd:C00001', 'cpd:C00002', 'cpd:C00003'], output='pull-entries/')
Successful Entry Ids: cpd:C00001, cpd:C00002, cpd:C00003
Failed Entry Ids: none
Timed Out Entry Ids: none
An exception is thrown if you attempt to provide more entry IDs to the
pull()
method than what is accepted by KEGG.
import logging as log
try:
single_pull.pull(entry_ids=['cpd:C00001', 'cpd:C00002', 'cpd:C00003', 'cpd:C00004', 'cpd:C00005', 'cpd:C00006',
'cpd:C00007', 'cpd:C00008', 'cpd:C00009', 'cpd:C00010', 'cpd:C00011'], output='pull-entries/')
except ValueError as error:
log.error(error)
ERROR:root:Cannot create URL - The maximum number of entry IDs is 10 but 11 were provided
pull_dict
The pull_dict
method pulls entries just like the pull
method
except that it stores the entries in memory rather than in the file
system. This is helpful if the user of the API wishes to use the entries
immediately and doesn’t need them to be stored persistently.
Specifically, pull_dict
returns a tuple containing the
PullResult
as described above along with a dictionary mapping the
entry ID to the entry itself.
pull_result, compounds = single_pull.pull_dict(
entry_ids=['cpd:C00001', 'cpd:C00002', 'cpd:C00003'], entry_field='mol')
pull_result
Successful Entry Ids: cpd:C00001, cpd:C00002, cpd:C00003
Failed Entry Ids: none
Timed Out Entry Ids: none
The entries can be accessed from the dictionary using the provided entry IDs as keys.
print(compounds['cpd:C00001'])
3 2 0 0 0 0 0 0 0 0999 V2000
22.1250 -16.2017 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
23.6000 -15.2112 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
20.7129 -15.2859 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0
1 3 1 0 0 0
M END
> <ENTRY>
cpd:C00001
SingleProcessMultiplePull
and MultiProcessMultiplePull
To get past the limit on the number of entries that can be pulled at a
time, we have two classes capable of pulling an arbitrary number of
entries. There’s the SingleProcessMultiplePull
and
MultiProcessMultiplePull
. MultiProcessMultiplePull
will likely
pull faster since it pulls within multiple processes, but it requires
multiple cores. Like SinglePull
, these two classes have both a
pull
method and pull_dict
method which respectively return a
PullResult
and a tuple containing a pull result and dictionary.
multiple_pull = p.SingleProcessMultiplePull()
entry_ids = [
'cpd:C00001',
'cpd:C00002',
'cpd:C00003',
'cpd:C00004',
'cpd:C00005',
'cpd:C00006',
'cpd:C00007',
'cpd:C00008',
'cpd:C00009',
'cpd:C00010',
'cpd:C00011'
]
multiple_pull.pull(entry_ids, output='pull-entries/')
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:02<00:00, 3.72it/s]
Successful Entry Ids: cpd:C00001, cpd:C00002, cpd:C00003, cpd:C00004, cpd:C00005, cpd:C00006, cpd:C00007, cpd:C00008, cpd:C00009, cpd:C00010, cpd:C00011
Failed Entry Ids: none
Timed Out Entry Ids: none
You can specify the number of processes to use for
MultiProcessMultiplePull
with the n_workers
parameter, which
defaults to the number of cores available.
multiple_pull = p.MultiProcessMultiplePull(n_workers=2)
multiple_pull.pull(entry_ids, output='pull-entries/')
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:01<00:00, 6.26it/s]
Successful Entry Ids: cpd:C00001, cpd:C00002, cpd:C00003, cpd:C00004, cpd:C00005, cpd:C00006, cpd:C00007, cpd:C00008, cpd:C00009, cpd:C00010, cpd:C00011
Failed Entry Ids: none
Timed Out Entry Ids: none
The pull_dict
method is also available:
pull_result, compounds = multiple_pull.pull_dict(entry_ids, entry_field='mol')
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:01<00:00, 10.72it/s]
print(compounds['cpd:C00011'])
3 2 0 0 0 0 0 0 0 0999 V2000
21.8400 -11.9918 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
20.6288 -12.6940 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
23.0512 -12.6940 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1 2 2 0 0 0
1 3 2 0 0 0
M END
> <ENTRY>
cpd:C00011
$$$$
Entry IDs
The entry_ids
module provides a number of different ways to pull a
list of KEGG entry IDs.
import kegg_pull.entry_ids as ei
entry_ids = ei.from_database('brite')
print(entry_ids)
['br:br08901', 'br:br08902', 'br:br08904', 'br:br08906', 'br:ko00001', 'br:ko00002', 'br:ko00003', 'br:br08907', 'br:ko01000', 'br:ko01001', 'br:ko01009', 'br:ko01002', 'br:ko01003', 'br:ko01005', 'br:ko01011', 'br:ko01004', 'br:ko01008', 'br:ko01006', 'br:ko01007', 'br:ko00199', 'br:ko00194', 'br:ko03000', 'br:ko03021', 'br:ko03019', 'br:ko03041', 'br:ko03011', 'br:ko03009', 'br:ko03016', 'br:ko03012', 'br:ko03110', 'br:ko04131', 'br:ko04121', 'br:ko03051', 'br:ko03032', 'br:ko03036', 'br:ko03400', 'br:ko03029', 'br:ko02000', 'br:ko02044', 'br:ko02042', 'br:ko02022', 'br:ko02035', 'br:ko03037', 'br:ko04812', 'br:ko04147', 'br:ko02048', 'br:ko04030', 'br:ko04050', 'br:ko04054', 'br:ko03310', 'br:ko04040', 'br:ko04031', 'br:ko04052', 'br:ko04515', 'br:ko04090', 'br:ko01504', 'br:ko00535', 'br:ko00536', 'br:ko00537', 'br:ko04091', 'br:ko04990', 'br:ko03200', 'br:ko03210', 'br:ko03100', 'br:br08001', 'br:br08002', 'br:br08003', 'br:br08005', 'br:br08006', 'br:br08007', 'br:br08009', 'br:br08021', 'br:br08120', 'br:br08201', 'br:br08202', 'br:br08204', 'br:br08203', 'br:br08303', 'br:br08302', 'br:br08301', 'br:br08313', 'br:br08312', 'br:br08304', 'br:br08305', 'br:br08331', 'br:br08330', 'br:br08332', 'br:br08310', 'br:br08307', 'br:br08327', 'br:br08311', 'br:br08402', 'br:br08401', 'br:br08403', 'br:br08411', 'br:br08410', 'br:br08420', 'br:br08601', 'br:br08610', 'br:br08611', 'br:br08612', 'br:br08613', 'br:br08614', 'br:br08615', 'br:br08620', 'br:br08621', 'br:br08605', 'br:br03220', 'br:br03222', 'br:br03223', 'br:br01610', 'br:br01611', 'br:br01612', 'br:br01613', 'br:br01601', 'br:br01602', 'br:br01600', 'br:br01620', 'br:br01553', 'br:br01554', 'br:br01556', 'br:br01555', 'br:br01557', 'br:br01800', 'br:br01810', 'br:br08011', 'br:br08020', 'br:br08012', 'br:br08319', 'br:br08329', 'br:br08318', 'br:br08328', 'br:br08309', 'br:br08341', 'br:br08324', 'br:br08317', 'br:br08315', 'br:br08314', 'br:br08442', 'br:br08441', 'br:br08431']
Entry ID Mappings
The map
module converts the output of the KEGG REST API “link”
operation or “conv” operation into dictionaries usable in python code.
“link” operation
import kegg_pull.map as kmap
pathway_to_compound = kmap.entries_link(entry_ids=['path:map00010', 'path:map00020'], target_database='compound')
print(pathway_to_compound)
{'path:map00010': {'cpd:C00186', 'cpd:C00033', 'cpd:C00036', 'cpd:C00668', 'cpd:C06188', 'cpd:C01172', 'cpd:C05378', 'cpd:C00118', 'cpd:C06186', 'cpd:C00084', 'cpd:C05125', 'cpd:C00074', 'cpd:C00267', 'cpd:C00024', 'cpd:C00631', 'cpd:C00469', 'cpd:C00236', 'cpd:C00103', 'cpd:C00197', 'cpd:C00221', 'cpd:C00068', 'cpd:C15972', 'cpd:C00031', 'cpd:C00111', 'cpd:C01159', 'cpd:C16255', 'cpd:C15973', 'cpd:C05345', 'cpd:C00022', 'cpd:C06187', 'cpd:C01451'}, 'path:map00020': {'cpd:C00036', 'cpd:C00091', 'cpd:C00042', 'cpd:C00122', 'cpd:C05125', 'cpd:C00074', 'cpd:C00024', 'cpd:C00026', 'cpd:C16254', 'cpd:C00068', 'cpd:C15972', 'cpd:C00158', 'cpd:C00149', 'cpd:C00417', 'cpd:C16255', 'cpd:C15973', 'cpd:C05381', 'cpd:C00311', 'cpd:C05379', 'cpd:C00022'}}
“conv” operation
kegg_to_pubchem = kmap.entries_conv(entry_ids=['cpd:C00001', 'cpd:C00002'], target_database='pubchem')
print(kegg_to_pubchem)
{'cpd:C00001': {'pubchem:3303'}, 'cpd:C00002': {'pubchem:3304'}}
Pathway Organizer
The pathway_organizer
module flattens a brite hierarchy into a
mapping of the IDs of its nodes to information about those nodes.
import kegg_pull.pathway_organizer as po
pathway_org = po.PathwayOrganizer.load_from_kegg()
print(pathway_org.hierarchy_nodes['Metabolism'])
{'name': 'Metabolism', 'level': 1, 'parent': None, 'children': ['Amino acid metabolism', 'Biosynthesis of other secondary metabolites', 'Carbohydrate metabolism', 'Chemical structure transformation maps', 'Energy metabolism', 'Global and overview maps', 'Glycan biosynthesis and metabolism', 'Lipid metabolism', 'Metabolism of cofactors and vitamins', 'Metabolism of other amino acids', 'Metabolism of terpenoids and polyketides', 'Nucleotide metabolism', 'Xenobiotics biodegradation and metabolism'], 'entry_id': None}
Rest API
The KEGGrest
class provides low-level wrapper methods for the KEGG
REST API, including all of its operations. The resulting
KEGGresponse
object contains both the text and binary versions of
the response body, the status of the response (one of SUCCESS
,
FAILED
, or TIMEOUT
), and the internal URL used to request from
the KEGG REST API. For most use-cases, the higher-level kegg_pull
functionality will be preferred over using the lower-level KEGGrest
class. In fact, most of kegg_pull
’s higher-level functionality
relies on the KEGGrest
class, but provies a more useful programming
interface.
import kegg_pull.rest as r
kegg_rest = r.KEGGrest()
kegg_response = kegg_rest.info(database='module')
kegg_response.status
<Status.SUCCESS: 1>
kegg_response.text_body
'module KEGG Module Databasenmd Release 106.0+/04-07, Apr 23n Kanehisa Laboratoriesn 550 entriesnnlinked db pathwayn kon <org>n genomen compoundn glycann reactionn enzymen diseasen pubmedn'
kegg_response.kegg_url
https://rest.kegg.jp/info/module
CLI
The command line interface has 4 subcommands: pull
, entry-ids
,
map
, and rest
. They are analogous to the API modules and
methods.
pull
From a user-specified list of entry IDs
% kegg_pull pull entry-ids cpd:C00001,cpd:C00002,cpd:C00003 --output=compound-entries/
100%|█████████████████████████████████████████████| 3/3 [00:01<00:00, 1.99it/s]
% head compound-entries/cpd:C00001.txt
ENTRY C00001 Compound
NAME H2O;
Water
FORMULA H2O
EXACT_MASS 18.0106
MOL_WEIGHT 18.0153
REMARK Same as: D00001
REACTION R00001 R00002 R00004 R00005 R00009 R00010 R00011 R00017
R00022 R00024 R00025 R00026 R00028 R00036 R00041 R00044
R00045 R00047 R00048 R00052 R00053 R00054 R00055 R00056
The pull
subcommand creates a pull-results.json
file. You can
load it as a dictionary using the python json library.
import json as j
with open('pull-results.json', 'r') as file:
pull_results = j.load(file)
print(pull_results)
{'percent-success': 100.0, 'pull-minutes': 0.03, 'num-successful': 3, 'num-failed': 0, 'num-timed-out': 0, 'num-total': 3, 'successful-entry-ids': ['cpd:C00001', 'cpd:C00002', 'cpd:C00003'], 'failed-entry-ids': [], 'timed-out-entry-ids': []}
Below is what the pull-results.json
file contents look like:
% cat pull-results.json
{
"percent-success": 100.0,
"pull-minutes": 0.03,
"num-successful": 3,
"num-failed": 0,
"num-timed-out": 0,
"num-total": 3,
"successful-entry-ids": [
"cpd:C00001",
"cpd:C00002",
"cpd:C00003"
],
"failed-entry-ids": [],
"timed-out-entry-ids": []
}
Entry IDs can also be passed in from standard input when the
<entry-ids>
option is equal to -
rather than a comma-separated
list. This example saves the entries to a ZIP archive.
standard_input = """
cpd:C00001
cpd:C00002
cpd:C00003
"""
with open('standard_input.txt', 'w') as file:
file.write(standard_input)
% cat standard_input.txt | kegg_pull pull entry-ids - --output=compound-entries.zip
100%|█████████████████████████████████████████████| 3/3 [00:01<00:00, 2.04it/s]
From a database
% kegg_pull pull database brite --multi-process --n-workers=11 --output=brite-entries/
100%|█████████████████████████████████████████| 141/141 [00:19<00:00, 7.33it/s]
% ls brite-entries/
br:br08001.txt br:br08314.txt br:br08610.txt br:ko01003.txt br:ko03036.txt
br:br08002.txt br:br08315.txt br:br08611.txt br:ko01004.txt br:ko03037.txt
br:br08003.txt br:br08317.txt br:br08612.txt br:ko01005.txt br:ko03041.txt
br:br08005.txt br:br08318.txt br:br08613.txt br:ko01006.txt br:ko03051.txt
br:br08006.txt br:br08319.txt br:br08614.txt br:ko01007.txt br:ko03100.txt
br:br08007.txt br:br08324.txt br:br08615.txt br:ko01008.txt br:ko03110.txt
br:br08009.txt br:br08327.txt br:br08620.txt br:ko01009.txt br:ko03200.txt
br:br08021.txt br:br08328.txt br:br08621.txt br:ko01011.txt br:ko03210.txt
br:br08120.txt br:br08329.txt br:br08901.txt br:ko01504.txt br:ko03310.txt
br:br08201.txt br:br08330.txt br:br08902.txt br:ko02000.txt br:ko03400.txt
br:br08202.txt br:br08331.txt br:br08904.txt br:ko02022.txt br:ko04030.txt
br:br08203.txt br:br08332.txt br:br08906.txt br:ko02035.txt br:ko04031.txt
br:br08204.txt br:br08341.txt br:br08907.txt br:ko02042.txt br:ko04040.txt
br:br08301.txt br:br08401.txt br:ko00001.txt br:ko02044.txt br:ko04050.txt
br:br08302.txt br:br08402.txt br:ko00002.txt br:ko02048.txt br:ko04052.txt
br:br08303.txt br:br08403.txt br:ko00003.txt br:ko03000.txt br:ko04054.txt
br:br08304.txt br:br08410.txt br:ko00194.txt br:ko03009.txt br:ko04090.txt
br:br08305.txt br:br08411.txt br:ko00199.txt br:ko03011.txt br:ko04091.txt
br:br08307.txt br:br08420.txt br:ko00535.txt br:ko03012.txt br:ko04121.txt
br:br08309.txt br:br08431.txt br:ko00536.txt br:ko03016.txt br:ko04131.txt
br:br08310.txt br:br08441.txt br:ko00537.txt br:ko03019.txt br:ko04147.txt
br:br08311.txt br:br08442.txt br:ko01000.txt br:ko03021.txt br:ko04515.txt
br:br08312.txt br:br08601.txt br:ko01001.txt br:ko03029.txt br:ko04812.txt
br:br08313.txt br:br08605.txt br:ko01002.txt br:ko03032.txt br:ko04990.txt
% head pull-results.json
{
"percent-success": 85.11,
"pull-minutes": 0.32,
"num-successful": 120,
"num-failed": 21,
"num-timed-out": 0,
"num-total": 141,
"successful-entry-ids": [
"br:br08901",
"br:br08902",
Printing Entries
Alternatively, you can print the KEGG entries to the screen rather than saving them in separate files.
% kegg_pull pull entry-ids C00001,C00007 --entry-field=mol --print
100%|█████████████████████████████████████████████| 2/2 [00:00<00:00, 2.56it/s]
C00001
3 2 0 0 0 0 0 0 0 0999 V2000
22.1250 -16.2017 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
23.6000 -15.2112 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
20.7129 -15.2859 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0
1 3 1 0 0 0
M END
> <ENTRY>
cpd:C00001
C00007
2 1 0 0 0 0 0 0 0 0999 V2000
24.3446 -17.0048 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
25.7446 -17.0048 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1 2 2 0 0 0
M END
> <ENTRY>
cpd:C00007
entry-ids
% kegg_pull entry-ids database brite --output=brite-entry-ids.txt
% head brite-entry-ids.txt
br:br08901
br:br08902
br:br08904
br:br08906
br:ko00001
br:ko00002
br:ko00003
br:br08907
br:ko01000
br:ko01001
% kegg_pull entry-ids molec-attr drug --em=433 --em=434
dr:D00752
dr:D00892
dr:D02110
dr:D02114
dr:D02238
dr:D03088
dr:D04789
dr:D05806
dr:D05911
dr:D06342
dr:D07084
dr:D07761
dr:D07879
dr:D08757
dr:D09567
dr:D10084
dr:D10309
dr:D10661
dr:D11316
% kegg_pull entry-ids keywords compound protein,enzyme
cpd:C05197
cpd:C05312
cpd:C15972
cpd:C15973
map
% kegg_pull map link entry-ids path:map00010,path:map00020 compound --output=mapping.json
% head mapping.json
{
"path:map00010": [
"cpd:C00022",
"cpd:C00024",
"cpd:C00031",
"cpd:C00033",
"cpd:C00036",
"cpd:C00068",
"cpd:C00074",
"cpd:C00084",
pathway-organizer
% kegg_pull pathway-organizer --tln=Metabolism --fn="Global and overview maps,Carbohydrate metabolism" --output=hierarchy-nodes.json
% head hierarchy-nodes.json
{
"path:map00190": {
"name": "00190 Oxidative phosphorylation",
"level": 3,
"parent": "Energy metabolism",
"children": null,
"entry_id": "path:map00190"
},
"path:map00195": {
"name": "00195 Photosynthesis",
rest
% kegg_pull rest info enzyme
enzyme KEGG Enzyme Database
ec Release 106.0+/04-07, Apr 23
Kanehisa Laboratories
8,056 entries
linked db pathway
module
ko
<org>
vg
vp
ag
compound
glycan
reaction
rclass
% kegg_pull rest get cpd:C00007 --entry-field=mol
2 1 0 0 0 0 0 0 0 0999 V2000
24.3446 -17.0048 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
25.7446 -17.0048 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1 2 2 0 0 0
M END
> <ENTRY>
cpd:C00007
$$$$
% kegg_pull rest conv entry-ids gl:G13143,gl:G13141,gl:G13139 pubchem
gl:G13143 pubchem:405226698
gl:G13141 pubchem:405226697
gl:G13139 pubchem:405226696
The rest
subcommand additionally offers the --test
option to
determine if a request to the KEGG REST API will fail or pass before
actually executing the command.
% kegg_pull rest ddi invalid-drug-entry-id --test
False