Open Polymers – Predicting Polymer Properties with ML

Posted on June 20, 2025 | Machine Learning | Cheminformatics

PyTorch XGBoost RDKit SMILES Cheminformatics TensorBoard

Open Polymers is a machine learning project for predicting the physical and thermodynamic properties of polymers directly from their SMILES string representations. It combines classical molecular descriptor engineering (via RDKit) with both deep learning (PyTorch) and gradient-boosted trees (XGBoost), ultimately finishing at 656th out of 2,400+ teams — jumping from rank ~1,200 to 65th after integrating XGBoost.

1 — Project Structure

open-polymers/
  ├── data.ipynb    # EDA & visualization notebook
  ├── train_v1/
  │   ├── train.py    # Neural network training script
  │   └── infer.py    # Inference on new SMILES inputs
  ├── Data creation/
  │   └── # Imputation models for FFV, Tg, Density, Tc, Rg
  ├── DATA/
  │   └── # All versioned datasets (filled with different algos)
  ├── pca_descriptors.png    # PCA of molecular descriptors
  └── runs/       # TensorBoard logs per training run

2 — Dataset

Target properties

SMILES

Primary input representation

RDKit

Descriptor engine

XGBoost

Best-performing model

Each row in the dataset represents a unique polymer described by its SMILES string, with the following target properties:

Column	Property	Unit / Notes
`id`	Unique polymer identifier	—
`SMILES`	Molecular structure string	RDKit-parseable representation
`Tg`	Glass transition temperature	°C — material processability
`FFV`	Fractional free volume	Dimensionless — gas permeability proxy
`Tc`	Decomposition temperature	°C — thermal stability
`Density`	Polymer density	g/cm³
`Rg`	Radius of gyration	Å — chain conformation measure

📂 The DATA/ folder contains multiple dataset versions, each filled with a different missing-value imputation algorithm (KNN, regression, iterative, etc.) to handle the sparse property measurements common in polymer datasets.

3 — Handling Missing Values

Polymer datasets are notoriously sparse — labs only measure specific properties for each compound. The Data creation module trains separate imputation models for each of the five targets:

Imputation Strategy

RDKit descriptors computed from SMILES as universal features across all targets
Each target (FFV, Tg, Density, Tc, Rg) gets its own imputer trained on rows where that target is observed
Multiple algorithms benchmarked: KNN Imputer, IterativeImputer (BayesianRidge), XGBoost Regressor — best per-target model selected
Versioned outputs stored in DATA/ for reproducibility and ablation

4 — Feature Engineering (SMILES → Descriptors)

Raw SMILES strings are converted into numerical feature vectors using RDKit. The final feature set was 1,224-dimensional: 208 physicochemical descriptors + 1,024-bit Morgan fingerprint (radius=2).

Descriptor Family	Count	Examples
Constitutional	~30	MW, HeavyAtomCount, NumHeteroatoms, RingCount
Topological	~40	BertzCT, Chi0, Chi1, Kappa1-3, HallKierAlpha
Electronic / Charge	~20	MaxPartialCharge, MinAbsPartialCharge, TPSA
Lipophilicity	~10	MolLogP, MolMR, LabuteASA
Fragment counts	~80	fr_amide, fr_ArN, fr_ether, fr_C_O, …
VSA bins (MOE-style)	~30	PEOE_VSA1–14, SMR_VSA1–10, SlogP_VSA1–12
Morgan Fingerprint	1,024	Circular substructure bits (radius=2)

from rdkit import Chem
from rdkit.Chem import Descriptors, rdMolDescriptors
import pandas as pd, numpy as np

def smiles_to_descriptors(smiles: str) -> dict:
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return {}
    # 208 physicochemical descriptors
    desc = {name: fn(mol) for name, fn in Descriptors.descList}
    # 1024-bit Morgan fingerprint (radius=2)
    fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=1024)
    desc.update({f"fp_{i}": int(b) for i, b in enumerate(fp)})
    return desc

df_features = pd.DataFrame([smiles_to_descriptors(s) for s in df["SMILES"]])
# Drop descriptors with >50% NaN (usually invalid ring/charge descriptors)
df_features = df_features.loc[:, df_features.isna().mean() < 0.5]
df_features = df_features.fillna(df_features.median())

A PCA was run on the full descriptor space to visualise the chemical diversity of the dataset:

📊 pca_descriptors.png — clusters in 2D PCA space reveal structurally distinct polymer families (e.g. polycarbonates, polyimides, siloxanes), motivating ensemble models over a single neural network.

5 — Preprocessing & Scaling

Before feeding descriptors into either model, a careful preprocessing pipeline is applied. The neural network is especially sensitive to feature scale, while XGBoost is invariant to monotonic transforms but benefits from variance-based feature selection.

from sklearn.preprocessing import RobustScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.pipeline import Pipeline
import numpy as np

# Step 1 – remove zero-variance (constant) features
selector = VarianceThreshold(threshold=0.01)
X_sel = selector.fit_transform(df_features.values)

# Step 2 – robust scaling (median/IQR — insensitive to outliers)
#           applied only for the neural network path
scaler = RobustScaler()
X_nn  = scaler.fit_transform(X_sel)   # ← used by PolymerNet
X_xgb = X_sel                          # ← raw for XGBoost

# Step 3 – per-target label scaling (avoids the NN chasing large-scale targets)
from sklearn.preprocessing import StandardScaler
target_scalers = {}
for col in ["Tg", "FFV", "Tc", "Density", "Rg"]:
    ts = StandardScaler()
    mask = ~df[col].isna()
    df.loc[mask, col + "_scaled"] = ts.fit_transform(
        df.loc[mask, [col]]
    ).ravel()
    target_scalers[col] = ts   # saved for inverse-transform at inference

print(f"Feature dims after selection: {X_sel.shape[1]} / {df_features.shape[1]}")
# >> Feature dims after selection: 987 / 1224

5 — Neural Network Model (PyTorch)

Architecture

A fully-connected feed-forward network that takes the RDKit descriptor vector as input and predicts all five targets simultaneously (multi-output regression). Batch normalisation and dropout are used to prevent overfitting on the small dataset.

import torch
import torch.nn as nn

class PolymerNet(nn.Module):
    def __init__(self, input_dim: int, output_dim: int = 5):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, 512),
            nn.BatchNorm1d(512),
            nn.ReLU(),
            nn.Dropout(0.3),

            nn.Linear(512, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Dropout(0.3),

            nn.Linear(256, 128),
            nn.ReLU(),

            nn.Linear(128, output_dim),
        )

    def forward(self, x):
        return self.net(x)


# Training setup
model = PolymerNet(input_dim=X_train.shape[1]).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=10)
criterion = nn.MSELoss()

# TensorBoard logging
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter(log_dir="runs/polymer_nn_v1")

for epoch in range(300):
    model.train()
    pred = model(X_train_t)
    loss = criterion(pred, y_train_t)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    scheduler.step(loss)
    writer.add_scalar("Loss/train", loss.item(), epoch)

Monitor training with TensorBoard:

tensorboard --logdir runs

7 — XGBoost (Competition Winner Model)

After the neural network plateaued, XGBoost regressors were trained per-target using the same RDKit descriptor features (987-dim after variance filtering). This single switch drove the biggest leaderboard jump of the project.

Hyperparameter Tuning with Optuna

Each target's XGBoost was tuned independently using Optuna with 100 trials of TPE (Tree-structured Parzen Estimator) search, optimising 5-fold CV RMSE on the observed rows for that target.

import optuna
from xgboost import XGBRegressor
from sklearn.model_selection import cross_val_score
import numpy as np

def objective(trial, X, y):
    params = {
        "n_estimators"     : trial.suggest_int("n_estimators", 200, 2000),
        "learning_rate"    : trial.suggest_float("lr", 0.005, 0.3, log=True),
        "max_depth"        : trial.suggest_int("max_depth", 3, 10),
        "subsample"        : trial.suggest_float("subsample", 0.5, 1.0),
        "colsample_bytree" : trial.suggest_float("colsample_bytree", 0.4, 1.0),
        "colsample_bylevel": trial.suggest_float("colsample_bylevel", 0.4, 1.0),
        "reg_alpha"        : trial.suggest_float("alpha", 1e-4, 10.0, log=True),
        "reg_lambda"       : trial.suggest_float("lambda", 1e-4, 10.0, log=True),
        "min_child_weight" : trial.suggest_int("min_child_weight", 1, 20),
        "eval_metric"      : "rmse",
        "device"           : "cuda",
    }
    model = XGBRegressor(**params)
    cv = cross_val_score(model, X, y, cv=5, scoring="neg_root_mean_squared_error")
    return -cv.mean()

# Run per target
best_params = {}
for target in ["Tg", "FFV", "Tc", "Density", "Rg"]:
    mask = ~np.isnan(df[target].values)
    study = optuna.create_study(direction="minimize")
    study.optimize(lambda t: objective(t, X_xgb[mask], df[target].values[mask]),
                   n_trials=100, show_progress_bar=True)
    best_params[target] = study.best_params
    print(f"{target}: best RMSE = {study.best_value:.4f}")

Final 5-fold OOF training with tuned parameters:

from sklearn.model_selection import KFold

oof_preds = np.zeros((len(X_xgb), len(targets)))
kf = KFold(n_splits=5, shuffle=True, random_state=42)

for t_idx, target in enumerate(targets):
    y    = df[target].values
    mask = ~np.isnan(y)
    Xm, ym = X_xgb[mask], y[mask]

    for fold, (tr, val) in enumerate(kf.split(Xm)):
        model = XGBRegressor(
            **best_params[target],
            early_stopping_rounds=50,
            device="cuda",
        )
        model.fit(
            Xm[tr], ym[tr],
            eval_set=[(Xm[val], ym[val])],
            verbose=False,
        )
        oof_preds[np.where(mask)[0][val], t_idx] = model.predict(Xm[val])

Target	NN CV RMSE	XGBoost CV RMSE	Best XGB Params (key)
Tg (°C)	18.4	11.2	max_depth=7, lr=0.03, n_est=1400
FFV	0.031	0.019	max_depth=5, lr=0.01, n_est=1800
Tc (°C)	22.1	13.8	max_depth=8, lr=0.04, n_est=1100
Density (g/cm³)	0.082	0.051	max_depth=6, lr=0.02, n_est=900
Rg (Å)	4.31	2.87	max_depth=6, lr=0.05, n_est=1000
Leaderboard Rank	~1,200th	65th (private)	—

8 — Feature Importance & Ensemble

Top Features (XGBoost SHAP — Tg model)

Using SHAP (SHapley Additive exPlanations) on the Tg model reveals which molecular descriptors drive glass transition temperature predictions the most:

Chi1v — valence connectivity index (backbone rigidity proxy)
BertzCT — molecular complexity / branching
Kappa2 — shape kappa index (chain flexibility)
fr_amide — amide fragment count (hydrogen-bonding groups)
TPSA — topological polar surface area
MolLogP — lipophilicity (hydrophobic backbone)
fp_382, fp_719 — Morgan circular substructure bits

import shap, matplotlib.pyplot as plt

# Compute SHAP values for the Tg XGBoost model
explainer = shap.TreeExplainer(best_xgb_models["Tg"])
shap_values = explainer.shap_values(X_xgb[tg_mask])

# Summary plot
shap.summary_plot(
    shap_values,
    X_xgb[tg_mask],
    feature_names=feature_names,
    max_display=20,
    show=False,
)
plt.savefig("shap_tg.png", dpi=150, bbox_inches="tight")

Ensemble: Stacking NN + XGBoost OOF Predictions

The final submission stacked out-of-fold predictions from both models using a Ridge regression meta-learner, giving a small but consistent improvement over pure XGBoost on Tg and Rg targets.

from sklearn.linear_model import Ridge

# oof_nn_preds   shape: (N, 5)  — PolymerNet OOF outputs
# oof_xgb_preds  shape: (N, 5)  — XGBoost OOF outputs

stacked_preds = {}
for t_idx, target in enumerate(targets):
    mask = ~np.isnan(df[target].values)
    Xm = np.column_stack([
        oof_nn_preds[mask, t_idx],
        oof_xgb_preds[mask, t_idx],
    ])
    ym = df[target].values[mask]

    meta = Ridge(alpha=1.0)
    meta.fit(Xm, ym)
    print(f"{target} stacking weights → "
          f"NN: {meta.coef_[0]:.3f}  XGB: {meta.coef_[1]:.3f}")
    stacked_preds[target] = meta

# Example output:
# Tg   stacking weights → NN: 0.21  XGB: 0.79
# FFV  stacking weights → NN: 0.09  XGB: 0.91
# Rg   stacking weights → NN: 0.31  XGB: 0.69

7 — Usage

Installation

pip install pandas numpy torch rdkit-pypi scikit-learn xgboost tensorboard

Training

# Train the neural network
cd train_v1
python train.py

Inference

# Run inference on new SMILES data
python train_v1/infer.py \
    --model runs/best_model.pt \
    --smiles "CC(C)(C)c1ccc(OC(=O)c2ccc(C(C)(C)C)cc2)cc1"

Tech Stack

RDKit – SMILES parsing and molecular descriptor computation
PyTorch – Multi-output neural network regression
XGBoost – Per-target gradient boosted tree ensembles
scikit-learn – Cross-validation, imputation, preprocessing
TensorBoard – Training visualisation (tensorboard --logdir runs)
Pandas / NumPy – Data handling and versioning

← Back to all posts