A seven-stage pipeline for curating ChEMBL into ready-to-use ML screening datasets: compound filtering, receptor structure validation, active clustering, decoy selection, and train/test splitting by receptor similarity.
PyMOL-rendered binding sites from two example targets in the curated dataset. Binding residues within 4Å shown as sticks; ligands as ball-and-stick; polar contacts as dashes.
Extract and filter ligand-target pairs from ChEMBL. Only high-confidence, single-receptor binding assay data is retained.
SQL-level filters on ChEMBL activities, assays, and target_dictionary tables,
plus RDKit compound validation.
| chembl_id | pchembl | smiles |
|---|---|---|
| CHEMBL3941457 | 5.5100 | O=C(Nc1ccc(F)cc1)N1CCN(c2ccnc3cc(Cl)ccc23)CC1 |
| CHEMBL3949206 | 5.6000 | Fc1ccc(NC(=S)N2CCN(c3ccnc4cc(Cl)ccc34)CC2)cc1 |
| CHEMBL144307 | 5.2600 | S=C(NCc1ccccc1)N1CCN(c2ccnc3cc(Cl)ccc23)CC1 |
| CHEMBL3968705 | 5.1200 | CC(=O)c1ccc(NC(=O)CN2CCN(c3ccnc4cc(Cl)ccc34)CC2... |
| CHEMBL3967142 | 5.0800 | Clc1ccc2c(N3CCN(Cc4c[nH]c5ccccc45)CC3)ccnc2c1 |
| CHEMBL3957660 | 5.0700 | O=C(c1cc2ccccc2[nH]1)N1CCN(c2ccnc3cc(Cl)ccc23)CC1 |
| CHEMBL3921744 | 5.0200 | O=[N+]([O-])c1cnc(N2CCN(c3ccnc4cc(Cl)ccc34)CC2)s1 |
| CHEMBL3895213 | 5.0400 | O=C(CN1CCN(c2ccnc3cc(Cl)ccc23)CC1)Nc1ccccn1 |
| CHEMBL36506 | 7.3400 | CO[C@@H]1[C@@H](OC(N)=O)[C@@H](O)[C@H](Oc2ccc3c... |
# Stage 1: curate compounds from ChEMBL chembl-curator curate --download --output $DATA
Fetch PDB structures, download AlphaFold models, align via TM-align, and keep only targets with a single, validated binding site.
| PDB_ID | Chain | Aligned_File | Ligand | Center_X | Center_Y | Center_Z |
|---|---|---|---|---|---|---|
| 4B6C | A | 4b6c_A.pdb | B5U | 21.581 | 12.742 | -2.526 |
| 4BAE | A | 4bae_A.pdb | RWX | 21.351 | 14.055 | -1.085 |
| 4BAE | B | 4bae_B.pdb | RWX | 21.352 | 14.125 | -1.259 |
| 4BAE | C | 4bae_C.pdb | RWX | 21.401 | 14.103 | -1.084 |
| 4BAE | D | 4bae_D.pdb | RWX | 21.271 | 14.168 | -1.215 |
| 6ZT3 | A | 6zt3_A.pdb | ANP | 18.445 | 9.574 | -2.669 |
| PDB_ID | Chain | Aligned_File | Ligand | Center_X | Center_Y | Center_Z |
|---|---|---|---|---|---|---|
| 4POP | A | 4pop_A.pdb | 2VY | 8.514 | 6.315 | -7.388 |
| 4POV | A | 4pov_A.pdb | 2VY | 8.483 | 6.307 | -7.404 |
| 4POV | B | 4pov_B.pdb | 2VY | 8.449 | 6.347 | -7.370 |
# Stage 2: validate receptor structures and binding sites chembl-curator filter-proteins --curated-dir $DATA --n-processes 8
Butina clustering of actives per target (Tanimoto ≥ 0.7). The highest-pChEMBL compound is picked as representative per cluster.
| chembl_id | pchembl | smiles | cluster_size |
|---|---|---|---|
| CHEMBL3941457 | 5.5100 | O=C(Nc1ccc(F)cc1)N1CCN(c2ccnc3cc(Cl)ccc23)CC1 | 6 |
| CHEMBL3949206 | 5.6000 | Fc1ccc(NC(=S)N2CCN(c3ccnc4cc(Cl)ccc34)CC2)cc1 | 2 |
| CHEMBL144307 | 5.2600 | S=C(NCc1ccccc1)N1CCN(c2ccnc3cc(Cl)ccc23)CC1 | 1 |
| CHEMBL3968705 | 5.1200 | CC(=O)c1ccc(NC(=O)CN2CCN(c3ccnc4cc(Cl)ccc34)CC2... | 1 |
| CHEMBL3967142 | 5.0800 | Clc1ccc2c(N3CCN(Cc4c[nH]c5ccccc45)CC3)ccnc2c1 | 1 |
| CHEMBL3957660 | 5.0700 | O=C(c1cc2ccccc2[nH]1)N1CCN(c2ccnc3cc(Cl)ccc23)CC1 | 1 |
| CHEMBL3921744 | 5.0200 | O=[N+]([O-])c1cnc(N2CCN(c3ccnc4cc(Cl)ccc34)CC2)s1 | 1 |
| CHEMBL3895213 | 5.0400 | O=C(CN1CCN(c2ccnc3cc(Cl)ccc23)CC1)Nc1ccccn1 | 1 |
| CHEMBL36506 | 7.3400 | CO[C@@H]1[C@@H](OC(N)=O)[C@@H](O)[C@H](Oc2ccc3c... | 1 |
| chembl_id | pchembl | smiles | cluster_size |
|---|---|---|---|
| CHEMBL4289524 | 6.2800 | Cc1ncc(Cc2cccc(CO)c2)c(N)n1 | 3 |
| CHEMBL401772 | 8.3700 | Cc1ncc(Cc2csc(CCO)c2C)c(N)n1 | 2 |
| CHEMBL4278850 | 6.5800 | Cc1ncc(Cc2cccc(CCO)c2)c(N)n1 | 1 |
| CHEMBL4289112 | 7.7000 | Cc1ncc(Cc2csc(CCOC(=O)c3c[nH]c4ccccc34)c2C)c(N)n1 | 1 |
| CHEMBL4287450 | 7.0100 | Cc1ncc(Cc2csc(CNCc3ccccn3)c2C)c(N)n1 | 1 |
| CHEMBL4285718 | 6.6800 | Cc1ncc(Cc2csc(COP(=O)(O)OP(=O)(O)O)c2C)c(N)n1 | 1 |
| CHEMBL4284640 | 6.7800 | Cc1ncc(Cc2csc(CNCC(c3ccccc3)c3ccccc3)c2C)c(N)n1 | 1 |
| CHEMBL4281235 | 7.3900 | Cc1ncc(Cc2csc(COCc3ccccc3)c2C)c(N)n1 | 1 |
| CHEMBL4279528 | 6.7400 | Cc1ncc(Cc2csc(CCS[C@@H]3O[C@H](CO)[C@@H](O)[C@H... | 1 |
| CHEMBL1547 | 9.9100 | Cc1ncc(C[n+]2csc(CCO)c2C)c(N)n1 | 1 |
| CHEMBL1236376 | 8.8000 | Cc1ncc(C[n+]2csc(CCOP(=O)(O)OP(=O)(O)O)c2C)c(N)n1 | 1 |
| CHEMBL1229798 | 9.7400 | Cc1ncc(C[n+]2cccc(CCO)c2C)c(N)n1 | 1 |
# Stage 3: cluster actives per target (Butina, Tanimoto ≥ 0.7) chembl-curator cluster-actives --data-dir $DATA --dist-thresh 0.3 --workers 8
Build a global pool from all clustered actives. Deduplicates by ChEMBL ID, computes molecular descriptors and Morgan fingerprints for property matching.
# Stage 4: build global compound pool chembl-curator build-pool --data-dir $DATA
Pairwise receptor similarity via sequence identity (MMseqs2) and pocket RMSD (TM-align + pocket extraction). Used for decoy exclusion and train/test splitting.
| query | target | seqid |
|---|---|---|
| P38646 | P11021 | 0.5640 |
| P38646 | Q92598 | 0.3160 |
| Q62645 | O15399 | 0.9610 |
| Q62645 | Q14957 | 0.6250 |
| Q62645 | Q00961 | 0.6290 |
| Q62645 | Q00959 | 0.5870 |
| Q62645 | Q05586 | 0.2750 |
| Q62645 | P42264 | 0.3110 |
| Q62645 | P39086 | 0.2800 |
| Q62645 | Q13002 | 0.2790 |
| Q62645 | Q01812 | 0.3000 |
| Q62645 | P19493 | 0.3030 |
| Q62645 | P19492 | 0.3010 |
| Q9UL51 | O88703 | 0.9140 |
| Q9UL51 | Q9Y3Q4 | 0.8680 |
| target_a | target_b | tm_score | pocket_rmsd | n_matched |
|---|---|---|---|---|
| A0A0H2UPP7 | A0A6L0XH39 | 0.23398245388296907 | 4.457746544815832 | 9 |
| A0A0H2UPP7 | A0A6L8P2U9 | 0.24220491042625336 | 5.120543242815301 | 3 |
| A0A0H2UPP7 | A0QNE0 | 0.21491255529453854 | -1.0 | 0 |
| A0A0H2UPP7 | A0R607 | 0.1929788593426652 | -1.0 | 0 |
| A0A0H2UPP7 | A2RI47 | 0.3008240842355274 | -1.0 | 0 |
| A0A0H2UPP7 | A2RP81 | 0.27071351149842504 | 5.886550447296809 | 7 |
| A0A0H2UPP7 | A4D1P6 | 0.257588243635652 | -1.0 | 0 |
| A0A0H2UPP7 | A5H660 | 0.25282100967882776 | -1.0 | 0 |
| A0A0H2UPP7 | A5K1A2 | 0.22546256691889008 | -1.0 | 0 |
| A0A0H2UPP7 | B2RXH2 | 0.2439346030214735 | 5.9403022852225105 | 6 |
| A0A0H2UPP7 | B4EB35 | 0.27991083621217 | 5.916345725506166 | 7 |
| A0A0H2UPP7 | D1MEN9 | 0.2744626204110205 | -1.0 | 1 |
| A0A0H2UPP7 | D3W065 | 0.37759373361427473 | 4.702880004214089 | 8 |
| A0A0H2UPP7 | E5KIY2 | 0.2825562962047324 | -1.0 | 0 |
| A0A0H2UPP7 | F1M391 | 0.36013480001944065 | -1.0 | 0 |
# Stage 5: compute pairwise receptor similarity chembl-curator receptor-sim --data-dir $DATA --mode both --workers 8
Property-matched, chemically dissimilar decoys per active, using receptor similarity from Stage 5 to exclude compounds active against similar targets (seqid > 0.6 OR pocket RMSD < 3Å).
| active_chembl_id | decoy_chembl_ids |
|---|---|
| CHEMBL36506 | CHEMBL5771943, CHEMBL1814275, CHEMBL327404, CHEMBL6064676, CHEMBL3393486 ... +25 more |
| CHEMBL3941457 | CHEMBL5441140, CHEMBL5558804, CHEMBL3937054, CHEMBL178288, CHEMBL1617447 ... +25 more |
| CHEMBL144307 | CHEMBL2018926, CHEMBL159648, CHEMBL397473, CHEMBL341936, CHEMBL1797400 ... +25 more |
| CHEMBL3967142 | CHEMBL1254419, CHEMBL5772051, CHEMBL1760949, CHEMBL3415163, CHEMBL285895 ... +25 more |
| CHEMBL3949206 | CHEMBL84413, CHEMBL215704, CHEMBL328283, CHEMBL359834, CHEMBL467240 ... +25 more |
| CHEMBL3921744 | CHEMBL3220124, CHEMBL328993, CHEMBL1951882, CHEMBL5879396, CHEMBL4159686 ... +25 more |
| CHEMBL3895213 | CHEMBL2381363, CHEMBL410598, CHEMBL1800770, CHEMBL5748326, CHEMBL3950988 ... +25 more |
| CHEMBL3957660 | CHEMBL563247, CHEMBL458289, CHEMBL395795, CHEMBL4473652, CHEMBL5897022 ... +25 more |
| CHEMBL3968705 | CHEMBL5408526, CHEMBL3664335, CHEMBL3670601, CHEMBL208454, CHEMBL5399671 ... +25 more |
# Stage 6: select property-matched, receptor-aware decoys chembl-curator select-decoys --data-dir $DATA --max-decoys 30
Sequence-identity clustering (MMseqs2) ensures no similar receptors leak between splits. Greedy assignment balances per-source ratios. Sampling weight = 1 / log2(cluster_size + 1).
| source | entry_id | compound | weight |
|---|---|---|---|
| biolip | 10gs_VWW_A_1 | - | 0.19 |
| biolip | 11gs_EAA_A_1 | - | 0.19 |
| biolip | 16pk_BIS_A_1 | - | 0.26 |
| pdbbind | 3eml | - | 0.17 |
| chembl | A0A0H2UPP7 | CHEMBL405346 | 0.17 |
| chembl | A0A0H2UPP7 | CHEMBL407216 | 0.17 |
| chembl | A0A6L8P2U9 | CHEMBL557281 | 0.17 |
| chembl | A0QNE0 | CHEMBL36506 | 0.19 |
| source | entry_id | compound | weight |
|---|---|---|---|
| biolip | 1a5v_Y3_A_1 | - | 1.00 |
| biolip | 1a99_PUT_A_1 | - | 0.43 |
| pdbbind | 1b55 | - | 0.39 |
| chembl | A2RI47 | CHEMBL4289524 | 0.33 |
| chembl | A2RI47 | CHEMBL401772 | 0.33 |
| chembl | O14936 | CHEMBL291278 | 0.43 |
| chembl | O14936 | CHEMBL375163 | 0.43 |
| chembl | O14936 | CHEMBL50981 | 0.43 |
| uniprot | split | n_actives | n_decoys |
|---|---|---|---|
| A0A0H2UPP7 | train | 2 | 60 |
| A0A6L0XH39 | train | 1 | 30 |
| A0A6L8P2U9 | train | 14 | 420 |
| A0QNE0 | train | 9 | 270 |
| A0R607 | train | 1 | 30 |
| A2RI47 | test | 12 | 360 |
| A2RP81 | test | 6 | 162 |
| A4D1P6 | train | 3 | 90 |
| A5H660 | train | 65 | 1950 |
| A5K1A2 | train | 3 | 90 |
| B2RXH2 | train | 50 | 1233 |
| B4EB35 | test | 1 | 30 |
| D1MEN9 | train | 1 | 30 |
| D3W065 | train | 1 | 30 |
| E5KIY2 | train | 4 | 90 |
| F1M391 | train | 1 | 30 |
| F1QCV2 | train | 11 | 330 |
| G3FIN0 | train | 1 | 30 |
| I6L8L7 | train | 1 | 30 |
| I6WXK4 | test | 24 | 720 |
# Stage 7: train/test split by sequence identity clustering chembl-curator split --data-dir $DATA --valid-frac 0.1
Run the full pipeline end-to-end.
# Full pipeline DATA=curated_data_filtered chembl-curator curate --download --output $DATA chembl-curator filter-proteins --curated-dir $DATA --n-processes 8 chembl-curator cluster-actives --data-dir $DATA --workers 8 chembl-curator build-pool --data-dir $DATA chembl-curator receptor-sim --data-dir $DATA --mode both --workers 8 chembl-curator select-decoys --data-dir $DATA --max-decoys 30 chembl-curator split --data-dir $DATA --valid-frac 0.1