ChEMBL-Q

A seven-stage pipeline for curating ChEMBL into ready-to-use ML screening datasets: compound filtering, receptor structure validation, active clustering, decoy selection, and train/test splitting by receptor similarity.

1,543

Targets

172,092

Actives

5,064,675

Decoys

1,388/155

Train / Test

Filter Actives

→

Filter Receptors

→

Cluster Actives

→

Decoy Pool

→

Receptor Sim

→

Select Decoys

→

Split

Example Targets: Receptor-Ligand Structures

PyMOL-rendered binding sites from two example targets in the curated dataset. Binding residues within 4Å shown as sticks; ligands as ball-and-stick; polar contacts as dashes.

A0QNE0 — DNA Gyrase B

PDB 6ZT3 · 1.56 Å

Full structure with ANP ligand

ANP binding site detail

A2RI47 — Thiamine Transporter ThiT

PDB 4POP · 2.20 Å

Full structure with 2VY ligand

2VY binding site detail

1 Filter Active Compounds

Extract and filter ligand-target pairs from ChEMBL. Only high-confidence, single-receptor binding assay data is retained.

Default Filters

SQL-level filters on ChEMBL activities, assays, and target_dictionary tables, plus RDKit compound validation.

Example: A0QNE0 Actives (actives.tsv)

chembl_id	pchembl	smiles
CHEMBL3941457	5.5100	O=C(Nc1ccc(F)cc1)N1CCN(c2ccnc3cc(Cl)ccc23)CC1
CHEMBL3949206	5.6000	Fc1ccc(NC(=S)N2CCN(c3ccnc4cc(Cl)ccc34)CC2)cc1
CHEMBL144307	5.2600	S=C(NCc1ccccc1)N1CCN(c2ccnc3cc(Cl)ccc23)CC1
CHEMBL3968705	5.1200	CC(=O)c1ccc(NC(=O)CN2CCN(c3ccnc4cc(Cl)ccc34)CC2...
CHEMBL3967142	5.0800	Clc1ccc2c(N3CCN(Cc4c[nH]c5ccccc45)CC3)ccnc2c1
CHEMBL3957660	5.0700	O=C(c1cc2ccccc2[nH]1)N1CCN(c2ccnc3cc(Cl)ccc23)CC1
CHEMBL3921744	5.0200	O=[N+]([O-])c1cnc(N2CCN(c3ccnc4cc(Cl)ccc34)CC2)s1
CHEMBL3895213	5.0400	O=C(CN1CCN(c2ccnc3cc(Cl)ccc23)CC1)Nc1ccccn1
CHEMBL36506	7.3400	CO[C@@H]1[C@@H](OC(N)=O)[C@@H](O)[C@H](Oc2ccc3c...

# Stage 1: curate compounds from ChEMBL
chembl-curator curate --download --output $DATA

2 Filter Receptors

Fetch PDB structures, download AlphaFold models, align via TM-align, and keep only targets with a single, validated binding site.

Steps

Query UniProt for PDB entries
Download PDB structures + AlphaFold model
Detect ligand-bound structures (contact < 4Å)
Align to AlphaFold via TM-align
Cluster pocket centroids (< 10Å)
Keep single-binding-site targets only

Pocket Info: A0QNE0

PDB_ID	Chain	Aligned_File	Ligand	Center_X	Center_Y	Center_Z
4B6C	A	4b6c_A.pdb	B5U	21.581	12.742	-2.526
4BAE	A	4bae_A.pdb	RWX	21.351	14.055	-1.085
4BAE	B	4bae_B.pdb	RWX	21.352	14.125	-1.259
4BAE	C	4bae_C.pdb	RWX	21.401	14.103	-1.084
4BAE	D	4bae_D.pdb	RWX	21.271	14.168	-1.215
6ZT3	A	6zt3_A.pdb	ANP	18.445	9.574	-2.669

Pocket Info: A2RI47

PDB_ID	Chain	Aligned_File	Ligand	Center_X	Center_Y	Center_Z
4POP	A	4pop_A.pdb	2VY	8.514	6.315	-7.388
4POV	A	4pov_A.pdb	2VY	8.483	6.307	-7.404
4POV	B	4pov_B.pdb	2VY	8.449	6.347	-7.370

# Stage 2: validate receptor structures and binding sites
chembl-curator filter-proteins --curated-dir $DATA --n-processes 8

3 Cluster Actives

Butina clustering of actives per target (Tanimoto ≥ 0.7). The highest-pChEMBL compound is picked as representative per cluster.

A0QNE0: 15 actives → 9 clusters

chembl_id	pchembl	smiles	cluster_size
CHEMBL3941457	5.5100	O=C(Nc1ccc(F)cc1)N1CCN(c2ccnc3cc(Cl)ccc23)CC1	6
CHEMBL3949206	5.6000	Fc1ccc(NC(=S)N2CCN(c3ccnc4cc(Cl)ccc34)CC2)cc1	2
CHEMBL144307	5.2600	S=C(NCc1ccccc1)N1CCN(c2ccnc3cc(Cl)ccc23)CC1	1
CHEMBL3968705	5.1200	CC(=O)c1ccc(NC(=O)CN2CCN(c3ccnc4cc(Cl)ccc34)CC2...	1
CHEMBL3967142	5.0800	Clc1ccc2c(N3CCN(Cc4c[nH]c5ccccc45)CC3)ccnc2c1	1
CHEMBL3957660	5.0700	O=C(c1cc2ccccc2[nH]1)N1CCN(c2ccnc3cc(Cl)ccc23)CC1	1
CHEMBL3921744	5.0200	O=[N+]([O-])c1cnc(N2CCN(c3ccnc4cc(Cl)ccc34)CC2)s1	1
CHEMBL3895213	5.0400	O=C(CN1CCN(c2ccnc3cc(Cl)ccc23)CC1)Nc1ccccn1	1
CHEMBL36506	7.3400	CO[C@@H]1[C@@H](OC(N)=O)[C@@H](O)[C@H](Oc2ccc3c...	1

A2RI47: 15 actives → 12 clusters

chembl_id	pchembl	smiles	cluster_size
CHEMBL4289524	6.2800	Cc1ncc(Cc2cccc(CO)c2)c(N)n1	3
CHEMBL401772	8.3700	Cc1ncc(Cc2csc(CCO)c2C)c(N)n1	2
CHEMBL4278850	6.5800	Cc1ncc(Cc2cccc(CCO)c2)c(N)n1	1
CHEMBL4289112	7.7000	Cc1ncc(Cc2csc(CCOC(=O)c3c[nH]c4ccccc34)c2C)c(N)n1	1
CHEMBL4287450	7.0100	Cc1ncc(Cc2csc(CNCc3ccccn3)c2C)c(N)n1	1
CHEMBL4285718	6.6800	Cc1ncc(Cc2csc(COP(=O)(O)OP(=O)(O)O)c2C)c(N)n1	1
CHEMBL4284640	6.7800	Cc1ncc(Cc2csc(CNCC(c3ccccc3)c3ccccc3)c2C)c(N)n1	1
CHEMBL4281235	7.3900	Cc1ncc(Cc2csc(COCc3ccccc3)c2C)c(N)n1	1
CHEMBL4279528	6.7400	Cc1ncc(Cc2csc(CCS[C@@H]3O[C@H](CO)[C@@H](O)[C@H...	1
CHEMBL1547	9.9100	Cc1ncc(C[n+]2csc(CCO)c2C)c(N)n1	1
CHEMBL1236376	8.8000	Cc1ncc(C[n+]2csc(CCOP(=O)(O)OP(=O)(O)O)c2C)c(N)n1	1
CHEMBL1229798	9.7400	Cc1ncc(C[n+]2cccc(CCO)c2C)c(N)n1	1

# Stage 3: cluster actives per target (Butina, Tanimoto ≥ 0.7)
chembl-curator cluster-actives --data-dir $DATA --dist-thresh 0.3 --workers 8

4 Build Decoy Pool

Build a global pool from all clustered actives. Deduplicates by ChEMBL ID, computes molecular descriptors and Morgan fingerprints for property matching.

Pool Properties Per Compound

# Stage 4: build global compound pool
chembl-curator build-pool --data-dir $DATA

5 Receptor Similarity

Pairwise receptor similarity via sequence identity (MMseqs2) and pocket RMSD (TM-align + pocket extraction). Used for decoy exclusion and train/test splitting.

Sequence Identity (sample)

query	target	seqid
P38646	P11021	0.5640
P38646	Q92598	0.3160
Q62645	O15399	0.9610
Q62645	Q14957	0.6250
Q62645	Q00961	0.6290
Q62645	Q00959	0.5870
Q62645	Q05586	0.2750
Q62645	P42264	0.3110
Q62645	P39086	0.2800
Q62645	Q13002	0.2790
Q62645	Q01812	0.3000
Q62645	P19493	0.3030
Q62645	P19492	0.3010
Q9UL51	O88703	0.9140
Q9UL51	Q9Y3Q4	0.8680

Pocket RMSD (sample)

target_a	target_b	tm_score	pocket_rmsd	n_matched
A0A0H2UPP7	A0A6L0XH39	0.23398245388296907	4.457746544815832	9
A0A0H2UPP7	A0A6L8P2U9	0.24220491042625336	5.120543242815301	3
A0A0H2UPP7	A0QNE0	0.21491255529453854	-1.0	0
A0A0H2UPP7	A0R607	0.1929788593426652	-1.0	0
A0A0H2UPP7	A2RI47	0.3008240842355274	-1.0	0
A0A0H2UPP7	A2RP81	0.27071351149842504	5.886550447296809	7
A0A0H2UPP7	A4D1P6	0.257588243635652	-1.0	0
A0A0H2UPP7	A5H660	0.25282100967882776	-1.0	0
A0A0H2UPP7	A5K1A2	0.22546256691889008	-1.0	0
A0A0H2UPP7	B2RXH2	0.2439346030214735	5.9403022852225105	6
A0A0H2UPP7	B4EB35	0.27991083621217	5.916345725506166	7
A0A0H2UPP7	D1MEN9	0.2744626204110205	-1.0	1
A0A0H2UPP7	D3W065	0.37759373361427473	4.702880004214089	8
A0A0H2UPP7	E5KIY2	0.2825562962047324	-1.0	0
A0A0H2UPP7	F1M391	0.36013480001944065	-1.0	0

# Stage 5: compute pairwise receptor similarity
chembl-curator receptor-sim --data-dir $DATA --mode both --workers 8

6 Decoy Selection

Property-matched, chemically dissimilar decoys per active, using receptor similarity from Stage 5 to exclude compounds active against similar targets (seqid > 0.6 OR pocket RMSD < 3Å).

Matching Windows

A0QNE0 Decoys (30 per active)

active_chembl_id	decoy_chembl_ids
CHEMBL36506	CHEMBL5771943, CHEMBL1814275, CHEMBL327404, CHEMBL6064676, CHEMBL3393486 ... +25 more
CHEMBL3941457	CHEMBL5441140, CHEMBL5558804, CHEMBL3937054, CHEMBL178288, CHEMBL1617447 ... +25 more
CHEMBL144307	CHEMBL2018926, CHEMBL159648, CHEMBL397473, CHEMBL341936, CHEMBL1797400 ... +25 more
CHEMBL3967142	CHEMBL1254419, CHEMBL5772051, CHEMBL1760949, CHEMBL3415163, CHEMBL285895 ... +25 more
CHEMBL3949206	CHEMBL84413, CHEMBL215704, CHEMBL328283, CHEMBL359834, CHEMBL467240 ... +25 more
CHEMBL3921744	CHEMBL3220124, CHEMBL328993, CHEMBL1951882, CHEMBL5879396, CHEMBL4159686 ... +25 more
CHEMBL3895213	CHEMBL2381363, CHEMBL410598, CHEMBL1800770, CHEMBL5748326, CHEMBL3950988 ... +25 more
CHEMBL3957660	CHEMBL563247, CHEMBL458289, CHEMBL395795, CHEMBL4473652, CHEMBL5897022 ... +25 more
CHEMBL3968705	CHEMBL5408526, CHEMBL3664335, CHEMBL3670601, CHEMBL208454, CHEMBL5399671 ... +25 more

# Stage 6: select property-matched, receptor-aware decoys
chembl-curator select-decoys --data-dir $DATA --max-decoys 30

7 Train / Test Split

Sequence-identity clustering (MMseqs2) ensures no similar receptors leak between splits. Greedy assignment balances per-source ratios. Sampling weight = 1 / log₂(cluster_size + 1).

Train Split (sample)

source	entry_id	compound	weight
biolip	10gs_VWW_A_1	-	0.19
biolip	11gs_EAA_A_1	-	0.19
biolip	16pk_BIS_A_1	-	0.26
pdbbind	3eml	-	0.17
chembl	A0A0H2UPP7	CHEMBL405346	0.17
chembl	A0A0H2UPP7	CHEMBL407216	0.17
chembl	A0A6L8P2U9	CHEMBL557281	0.17
chembl	A0QNE0	CHEMBL36506	0.19

Test Split (sample)

source	entry_id	compound	weight
biolip	1a5v_Y3_A_1	-	1.00
biolip	1a99_PUT_A_1	-	0.43
pdbbind	1b55	-	0.39
chembl	A2RI47	CHEMBL4289524	0.33
chembl	A2RI47	CHEMBL401772	0.33
chembl	O14936	CHEMBL291278	0.43
chembl	O14936	CHEMBL375163	0.43
chembl	O14936	CHEMBL50981	0.43

Target Summary (sample)

uniprot	split	n_actives	n_decoys
A0A0H2UPP7	train	2	60
A0A6L0XH39	train	1	30
A0A6L8P2U9	train	14	420
A0QNE0	train	9	270
A0R607	train	1	30
A2RI47	test	12	360
A2RP81	test	6	162
A4D1P6	train	3	90
A5H660	train	65	1950
A5K1A2	train	3	90
B2RXH2	train	50	1233
B4EB35	test	1	30
D1MEN9	train	1	30
D3W065	train	1	30
E5KIY2	train	4	90
F1M391	train	1	30
F1QCV2	train	11	330
G3FIN0	train	1	30
I6L8L7	train	1	30
I6WXK4	test	24	720

# Stage 7: train/test split by sequence identity clustering
chembl-curator split --data-dir $DATA --valid-frac 0.1

Quick Start

Run the full pipeline end-to-end.

# Full pipeline
DATA=curated_data_filtered

chembl-curator curate --download --output $DATA
chembl-curator filter-proteins --curated-dir $DATA --n-processes 8
chembl-curator cluster-actives --data-dir $DATA --workers 8
chembl-curator build-pool --data-dir $DATA
chembl-curator receptor-sim --data-dir $DATA --mode both --workers 8
chembl-curator select-decoys --data-dir $DATA --max-decoys 30
chembl-curator split --data-dir $DATA --valid-frac 0.1