ChEMBL-Q

A seven-stage pipeline for curating ChEMBL into ready-to-use ML screening datasets: compound filtering, receptor structure validation, active clustering, decoy selection, and train/test splitting by receptor similarity.

1,543
Targets
172,092
Actives
5,064,675
Decoys
1,388/155
Train / Test
1
Filter Actives
2
Filter Receptors
3
Cluster Actives
4
Decoy Pool
5
Receptor Sim
6
Select Decoys
7
Split
Example Targets: Receptor-Ligand Structures

PyMOL-rendered binding sites from two example targets in the curated dataset. Binding residues within 4Å shown as sticks; ligands as ball-and-stick; polar contacts as dashes.

A0QNE0 — DNA Gyrase B

PDB 6ZT3 · 1.56 Å
A0QNE0 overview
Full structure with ANP ligand
A0QNE0 binding site
ANP binding site detail

A2RI47 — Thiamine Transporter ThiT

PDB 4POP · 2.20 Å
A2RI47 overview
Full structure with 2VY ligand
A2RI47 binding site
2VY binding site detail
1 Filter Active Compounds

Extract and filter ligand-target pairs from ChEMBL. Only high-confidence, single-receptor binding assay data is retained.

Default Filters

SQL-level filters on ChEMBL activities, assays, and target_dictionary tables, plus RDKit compound validation.

target_type: SINGLE PROTEIN
standard_type: Ki, Kd, IC50, EC50
standard_relation: =, ≤
standard_units: nM, μM
standard_value: <10,000 nM / <10 μM
pchembl_value: ≥ 5.0 (configurable)
confidence_score: ≥ 6 (configurable)
assay_type: B (Binding only)
bao_format: configurable (e.g. BAO_0000357)
data_validity_comment: must be NULL
potential_duplicate: excluded
standard_flag: optional (curated data)
Valid SMILES: RDKit parseable
Heavy atoms: 5–80

Example: A0QNE0 Actives (actives.tsv)

chembl_idpchemblsmiles
CHEMBL39414575.5100O=C(Nc1ccc(F)cc1)N1CCN(c2ccnc3cc(Cl)ccc23)CC1
CHEMBL39492065.6000Fc1ccc(NC(=S)N2CCN(c3ccnc4cc(Cl)ccc34)CC2)cc1
CHEMBL1443075.2600S=C(NCc1ccccc1)N1CCN(c2ccnc3cc(Cl)ccc23)CC1
CHEMBL39687055.1200CC(=O)c1ccc(NC(=O)CN2CCN(c3ccnc4cc(Cl)ccc34)CC2...
CHEMBL39671425.0800Clc1ccc2c(N3CCN(Cc4c[nH]c5ccccc45)CC3)ccnc2c1
CHEMBL39576605.0700O=C(c1cc2ccccc2[nH]1)N1CCN(c2ccnc3cc(Cl)ccc23)CC1
CHEMBL39217445.0200O=[N+]([O-])c1cnc(N2CCN(c3ccnc4cc(Cl)ccc34)CC2)s1
CHEMBL38952135.0400O=C(CN1CCN(c2ccnc3cc(Cl)ccc23)CC1)Nc1ccccn1
CHEMBL365067.3400CO[C@@H]1[C@@H](OC(N)=O)[C@@H](O)[C@H](Oc2ccc3c...
# Stage 1: curate compounds from ChEMBL
chembl-curator curate --download --output $DATA
2 Filter Receptors

Fetch PDB structures, download AlphaFold models, align via TM-align, and keep only targets with a single, validated binding site.

Steps

  1. Query UniProt for PDB entries
  2. Download PDB structures + AlphaFold model
  3. Detect ligand-bound structures (contact < 4Å)
  4. Align to AlphaFold via TM-align
  5. Cluster pocket centroids (< 10Å)
  6. Keep single-binding-site targets only

Pocket Info: A0QNE0

PDB_IDChainAligned_FileLigandCenter_XCenter_YCenter_Z
4B6CA4b6c_A.pdbB5U21.58112.742-2.526
4BAEA4bae_A.pdbRWX21.35114.055-1.085
4BAEB4bae_B.pdbRWX21.35214.125-1.259
4BAEC4bae_C.pdbRWX21.40114.103-1.084
4BAED4bae_D.pdbRWX21.27114.168-1.215
6ZT3A6zt3_A.pdbANP18.4459.574-2.669

Pocket Info: A2RI47

PDB_IDChainAligned_FileLigandCenter_XCenter_YCenter_Z
4POPA4pop_A.pdb2VY8.5146.315-7.388
4POVA4pov_A.pdb2VY8.4836.307-7.404
4POVB4pov_B.pdb2VY8.4496.347-7.370
# Stage 2: validate receptor structures and binding sites
chembl-curator filter-proteins --curated-dir $DATA --n-processes 8
3 Cluster Actives

Butina clustering of actives per target (Tanimoto ≥ 0.7). The highest-pChEMBL compound is picked as representative per cluster.

A0QNE0: 15 actives → 9 clusters

chembl_idpchemblsmilescluster_size
CHEMBL39414575.5100O=C(Nc1ccc(F)cc1)N1CCN(c2ccnc3cc(Cl)ccc23)CC16
CHEMBL39492065.6000Fc1ccc(NC(=S)N2CCN(c3ccnc4cc(Cl)ccc34)CC2)cc12
CHEMBL1443075.2600S=C(NCc1ccccc1)N1CCN(c2ccnc3cc(Cl)ccc23)CC11
CHEMBL39687055.1200CC(=O)c1ccc(NC(=O)CN2CCN(c3ccnc4cc(Cl)ccc34)CC2...1
CHEMBL39671425.0800Clc1ccc2c(N3CCN(Cc4c[nH]c5ccccc45)CC3)ccnc2c11
CHEMBL39576605.0700O=C(c1cc2ccccc2[nH]1)N1CCN(c2ccnc3cc(Cl)ccc23)CC11
CHEMBL39217445.0200O=[N+]([O-])c1cnc(N2CCN(c3ccnc4cc(Cl)ccc34)CC2)s11
CHEMBL38952135.0400O=C(CN1CCN(c2ccnc3cc(Cl)ccc23)CC1)Nc1ccccn11
CHEMBL365067.3400CO[C@@H]1[C@@H](OC(N)=O)[C@@H](O)[C@H](Oc2ccc3c...1

A2RI47: 15 actives → 12 clusters

chembl_idpchemblsmilescluster_size
CHEMBL42895246.2800Cc1ncc(Cc2cccc(CO)c2)c(N)n13
CHEMBL4017728.3700Cc1ncc(Cc2csc(CCO)c2C)c(N)n12
CHEMBL42788506.5800Cc1ncc(Cc2cccc(CCO)c2)c(N)n11
CHEMBL42891127.7000Cc1ncc(Cc2csc(CCOC(=O)c3c[nH]c4ccccc34)c2C)c(N)n11
CHEMBL42874507.0100Cc1ncc(Cc2csc(CNCc3ccccn3)c2C)c(N)n11
CHEMBL42857186.6800Cc1ncc(Cc2csc(COP(=O)(O)OP(=O)(O)O)c2C)c(N)n11
CHEMBL42846406.7800Cc1ncc(Cc2csc(CNCC(c3ccccc3)c3ccccc3)c2C)c(N)n11
CHEMBL42812357.3900Cc1ncc(Cc2csc(COCc3ccccc3)c2C)c(N)n11
CHEMBL42795286.7400Cc1ncc(Cc2csc(CCS[C@@H]3O[C@H](CO)[C@@H](O)[C@H...1
CHEMBL15479.9100Cc1ncc(C[n+]2csc(CCO)c2C)c(N)n11
CHEMBL12363768.8000Cc1ncc(C[n+]2csc(CCOP(=O)(O)OP(=O)(O)O)c2C)c(N)n11
CHEMBL12297989.7400Cc1ncc(C[n+]2cccc(CCO)c2C)c(N)n11
# Stage 3: cluster actives per target (Butina, Tanimoto ≥ 0.7)
chembl-curator cluster-actives --data-dir $DATA --dist-thresh 0.3 --workers 8
4 Build Decoy Pool

Build a global pool from all clustered actives. Deduplicates by ChEMBL ID, computes molecular descriptors and Morgan fingerprints for property matching.

Pool Properties Per Compound

MW — Molecular weight
cLogP — Lipophilicity
TPSA — Topological polar surface area
HBD / HBA — H-bond donors/acceptors
AromRings — Aromatic ring count
Morgan FP — 2048-bit, radius 2
# Stage 4: build global compound pool
chembl-curator build-pool --data-dir $DATA
5 Receptor Similarity

Pairwise receptor similarity via sequence identity (MMseqs2) and pocket RMSD (TM-align + pocket extraction). Used for decoy exclusion and train/test splitting.

Sequence Identity (sample)

querytargetseqid
P38646P110210.5640
P38646Q925980.3160
Q62645O153990.9610
Q62645Q149570.6250
Q62645Q009610.6290
Q62645Q009590.5870
Q62645Q055860.2750
Q62645P422640.3110
Q62645P390860.2800
Q62645Q130020.2790
Q62645Q018120.3000
Q62645P194930.3030
Q62645P194920.3010
Q9UL51O887030.9140
Q9UL51Q9Y3Q40.8680

Pocket RMSD (sample)

target_atarget_btm_scorepocket_rmsdn_matched
A0A0H2UPP7A0A6L0XH390.233982453882969074.4577465448158329
A0A0H2UPP7A0A6L8P2U90.242204910426253365.1205432428153013
A0A0H2UPP7A0QNE00.21491255529453854-1.00
A0A0H2UPP7A0R6070.1929788593426652-1.00
A0A0H2UPP7A2RI470.3008240842355274-1.00
A0A0H2UPP7A2RP810.270713511498425045.8865504472968097
A0A0H2UPP7A4D1P60.257588243635652-1.00
A0A0H2UPP7A5H6600.25282100967882776-1.00
A0A0H2UPP7A5K1A20.22546256691889008-1.00
A0A0H2UPP7B2RXH20.24393460302147355.94030228522251056
A0A0H2UPP7B4EB350.279910836212175.9163457255061667
A0A0H2UPP7D1MEN90.2744626204110205-1.01
A0A0H2UPP7D3W0650.377593733614274734.7028800042140898
A0A0H2UPP7E5KIY20.2825562962047324-1.00
A0A0H2UPP7F1M3910.36013480001944065-1.00
# Stage 5: compute pairwise receptor similarity
chembl-curator receptor-sim --data-dir $DATA --mode both --workers 8
6 Decoy Selection

Property-matched, chemically dissimilar decoys per active, using receptor similarity from Stage 5 to exclude compounds active against similar targets (seqid > 0.6 OR pocket RMSD < 3Å).

Matching Windows

±50 Da MW
±2 cLogP
±50 Ų TPSA
±2 HBD / HBA
±1 Aromatic rings
< 0.3 Tanimoto to active

A0QNE0 Decoys (30 per active)

active_chembl_iddecoy_chembl_ids
CHEMBL36506CHEMBL5771943, CHEMBL1814275, CHEMBL327404, CHEMBL6064676, CHEMBL3393486 ... +25 more
CHEMBL3941457CHEMBL5441140, CHEMBL5558804, CHEMBL3937054, CHEMBL178288, CHEMBL1617447 ... +25 more
CHEMBL144307CHEMBL2018926, CHEMBL159648, CHEMBL397473, CHEMBL341936, CHEMBL1797400 ... +25 more
CHEMBL3967142CHEMBL1254419, CHEMBL5772051, CHEMBL1760949, CHEMBL3415163, CHEMBL285895 ... +25 more
CHEMBL3949206CHEMBL84413, CHEMBL215704, CHEMBL328283, CHEMBL359834, CHEMBL467240 ... +25 more
CHEMBL3921744CHEMBL3220124, CHEMBL328993, CHEMBL1951882, CHEMBL5879396, CHEMBL4159686 ... +25 more
CHEMBL3895213CHEMBL2381363, CHEMBL410598, CHEMBL1800770, CHEMBL5748326, CHEMBL3950988 ... +25 more
CHEMBL3957660CHEMBL563247, CHEMBL458289, CHEMBL395795, CHEMBL4473652, CHEMBL5897022 ... +25 more
CHEMBL3968705CHEMBL5408526, CHEMBL3664335, CHEMBL3670601, CHEMBL208454, CHEMBL5399671 ... +25 more
# Stage 6: select property-matched, receptor-aware decoys
chembl-curator select-decoys --data-dir $DATA --max-decoys 30
7 Train / Test Split

Sequence-identity clustering (MMseqs2) ensures no similar receptors leak between splits. Greedy assignment balances per-source ratios. Sampling weight = 1 / log2(cluster_size + 1).

Train Split (sample)

sourceentry_idcompoundweight
biolip10gs_VWW_A_1-0.19
biolip11gs_EAA_A_1-0.19
biolip16pk_BIS_A_1-0.26
pdbbind3eml-0.17
chemblA0A0H2UPP7CHEMBL4053460.17
chemblA0A0H2UPP7CHEMBL4072160.17
chemblA0A6L8P2U9CHEMBL5572810.17
chemblA0QNE0CHEMBL365060.19

Test Split (sample)

sourceentry_idcompoundweight
biolip1a5v_Y3_A_1-1.00
biolip1a99_PUT_A_1-0.43
pdbbind1b55-0.39
chemblA2RI47CHEMBL42895240.33
chemblA2RI47CHEMBL4017720.33
chemblO14936CHEMBL2912780.43
chemblO14936CHEMBL3751630.43
chemblO14936CHEMBL509810.43

Target Summary (sample)

uniprotsplitn_activesn_decoys
A0A0H2UPP7train260
A0A6L0XH39train130
A0A6L8P2U9train14420
A0QNE0train9270
A0R607train130
A2RI47test12360
A2RP81test6162
A4D1P6train390
A5H660train651950
A5K1A2train390
B2RXH2train501233
B4EB35test130
D1MEN9train130
D3W065train130
E5KIY2train490
F1M391train130
F1QCV2train11330
G3FIN0train130
I6L8L7train130
I6WXK4test24720
# Stage 7: train/test split by sequence identity clustering
chembl-curator split --data-dir $DATA --valid-frac 0.1
Quick Start

Run the full pipeline end-to-end.

# Full pipeline
DATA=curated_data_filtered

chembl-curator curate --download --output $DATA
chembl-curator filter-proteins --curated-dir $DATA --n-processes 8
chembl-curator cluster-actives --data-dir $DATA --workers 8
chembl-curator build-pool --data-dir $DATA
chembl-curator receptor-sim --data-dir $DATA --mode both --workers 8
chembl-curator select-decoys --data-dir $DATA --max-decoys 30
chembl-curator split --data-dir $DATA --valid-frac 0.1