Open Polymers – Predicting Polymer Properties with ML

Posted on June 20, 2025  |  Machine Learning  |  Cheminformatics

PyTorch XGBoost RDKit SMILES Cheminformatics TensorBoard

Open Polymers is a machine learning project for predicting the physical and thermodynamic properties of polymers directly from their SMILES string representations. It combines classical molecular descriptor engineering (via RDKit) with both deep learning (PyTorch) and gradient-boosted trees (XGBoost), ultimately finishing at 656th out of 2,400+ teams — jumping from rank ~1,200 to 65th after integrating XGBoost.

🏆 Rank 656 / 2,400+
Final Leaderboard Position
XGBoost jump: ~1,200th → 65th on private score

1 — Project Structure

open-polymers/
  ├── data.ipynb    # EDA & visualization notebook
  ├── train_v1/
  │   ├── train.py    # Neural network training script
  │   └── infer.py    # Inference on new SMILES inputs
  ├── Data creation/
  │   └── # Imputation models for FFV, Tg, Density, Tc, Rg
  ├── DATA/
  │   └── # All versioned datasets (filled with different algos)
  ├── pca_descriptors.png    # PCA of molecular descriptors
  └── runs/       # TensorBoard logs per training run

2 — Dataset

5
Target properties
SMILES
Primary input representation
RDKit
Descriptor engine
XGBoost
Best-performing model

Each row in the dataset represents a unique polymer described by its SMILES string, with the following target properties:

Column Property Unit / Notes
id Unique polymer identifier
SMILES Molecular structure string RDKit-parseable representation
Tg Glass transition temperature °C — material processability
FFV Fractional free volume Dimensionless — gas permeability proxy
Tc Decomposition temperature °C — thermal stability
Density Polymer density g/cm³
Rg Radius of gyration Å — chain conformation measure
📂 The DATA/ folder contains multiple dataset versions, each filled with a different missing-value imputation algorithm (KNN, regression, iterative, etc.) to handle the sparse property measurements common in polymer datasets.

3 — Handling Missing Values

Polymer datasets are notoriously sparse — labs only measure specific properties for each compound. The Data creation module trains separate imputation models for each of the five targets:

Imputation Strategy

4 — Feature Engineering (SMILES → Descriptors)

Raw SMILES strings are converted into numerical feature vectors using RDKit. The final feature set was 1,224-dimensional: 208 physicochemical descriptors + 1,024-bit Morgan fingerprint (radius=2).

Descriptor Family Count Examples
Constitutional ~30 MW, HeavyAtomCount, NumHeteroatoms, RingCount
Topological ~40 BertzCT, Chi0, Chi1, Kappa1-3, HallKierAlpha
Electronic / Charge ~20 MaxPartialCharge, MinAbsPartialCharge, TPSA
Lipophilicity ~10 MolLogP, MolMR, LabuteASA
Fragment counts ~80 fr_amide, fr_ArN, fr_ether, fr_C_O, …
VSA bins (MOE-style) ~30 PEOE_VSA1–14, SMR_VSA1–10, SlogP_VSA1–12
Morgan Fingerprint 1,024 Circular substructure bits (radius=2)
from rdkit import Chem
from rdkit.Chem import Descriptors, rdMolDescriptors
import pandas as pd, numpy as np

def smiles_to_descriptors(smiles: str) -> dict:
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return {}
    # 208 physicochemical descriptors
    desc = {name: fn(mol) for name, fn in Descriptors.descList}
    # 1024-bit Morgan fingerprint (radius=2)
    fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=1024)
    desc.update({f"fp_{i}": int(b) for i, b in enumerate(fp)})
    return desc

df_features = pd.DataFrame([smiles_to_descriptors(s) for s in df["SMILES"]])
# Drop descriptors with >50% NaN (usually invalid ring/charge descriptors)
df_features = df_features.loc[:, df_features.isna().mean() < 0.5]
df_features = df_features.fillna(df_features.median())

A PCA was run on the full descriptor space to visualise the chemical diversity of the dataset:

📊 pca_descriptors.png — clusters in 2D PCA space reveal structurally distinct polymer families (e.g. polycarbonates, polyimides, siloxanes), motivating ensemble models over a single neural network.

5 — Preprocessing & Scaling

Before feeding descriptors into either model, a careful preprocessing pipeline is applied. The neural network is especially sensitive to feature scale, while XGBoost is invariant to monotonic transforms but benefits from variance-based feature selection.

from sklearn.preprocessing import RobustScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.pipeline import Pipeline
import numpy as np

# Step 1 – remove zero-variance (constant) features
selector = VarianceThreshold(threshold=0.01)
X_sel = selector.fit_transform(df_features.values)

# Step 2 – robust scaling (median/IQR — insensitive to outliers)
#           applied only for the neural network path
scaler = RobustScaler()
X_nn  = scaler.fit_transform(X_sel)   # ← used by PolymerNet
X_xgb = X_sel                          # ← raw for XGBoost

# Step 3 – per-target label scaling (avoids the NN chasing large-scale targets)
from sklearn.preprocessing import StandardScaler
target_scalers = {}
for col in ["Tg", "FFV", "Tc", "Density", "Rg"]:
    ts = StandardScaler()
    mask = ~df[col].isna()
    df.loc[mask, col + "_scaled"] = ts.fit_transform(
        df.loc[mask, [col]]
    ).ravel()
    target_scalers[col] = ts   # saved for inverse-transform at inference

print(f"Feature dims after selection: {X_sel.shape[1]} / {df_features.shape[1]}")
# >> Feature dims after selection: 987 / 1224

5 — Neural Network Model (PyTorch)

Architecture

A fully-connected feed-forward network that takes the RDKit descriptor vector as input and predicts all five targets simultaneously (multi-output regression). Batch normalisation and dropout are used to prevent overfitting on the small dataset.

import torch
import torch.nn as nn

class PolymerNet(nn.Module):
    def __init__(self, input_dim: int, output_dim: int = 5):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, 512),
            nn.BatchNorm1d(512),
            nn.ReLU(),
            nn.Dropout(0.3),

            nn.Linear(512, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Dropout(0.3),

            nn.Linear(256, 128),
            nn.ReLU(),

            nn.Linear(128, output_dim),
        )

    def forward(self, x):
        return self.net(x)


# Training setup
model = PolymerNet(input_dim=X_train.shape[1]).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=10)
criterion = nn.MSELoss()

# TensorBoard logging
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter(log_dir="runs/polymer_nn_v1")

for epoch in range(300):
    model.train()
    pred = model(X_train_t)
    loss = criterion(pred, y_train_t)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    scheduler.step(loss)
    writer.add_scalar("Loss/train", loss.item(), epoch)

Monitor training with TensorBoard:

tensorboard --logdir runs

7 — XGBoost (Competition Winner Model)

After the neural network plateaued, XGBoost regressors were trained per-target using the same RDKit descriptor features (987-dim after variance filtering). This single switch drove the biggest leaderboard jump of the project.

Hyperparameter Tuning with Optuna

Each target's XGBoost was tuned independently using Optuna with 100 trials of TPE (Tree-structured Parzen Estimator) search, optimising 5-fold CV RMSE on the observed rows for that target.

import optuna
from xgboost import XGBRegressor
from sklearn.model_selection import cross_val_score
import numpy as np

def objective(trial, X, y):
    params = {
        "n_estimators"     : trial.suggest_int("n_estimators", 200, 2000),
        "learning_rate"    : trial.suggest_float("lr", 0.005, 0.3, log=True),
        "max_depth"        : trial.suggest_int("max_depth", 3, 10),
        "subsample"        : trial.suggest_float("subsample", 0.5, 1.0),
        "colsample_bytree" : trial.suggest_float("colsample_bytree", 0.4, 1.0),
        "colsample_bylevel": trial.suggest_float("colsample_bylevel", 0.4, 1.0),
        "reg_alpha"        : trial.suggest_float("alpha", 1e-4, 10.0, log=True),
        "reg_lambda"       : trial.suggest_float("lambda", 1e-4, 10.0, log=True),
        "min_child_weight" : trial.suggest_int("min_child_weight", 1, 20),
        "eval_metric"      : "rmse",
        "device"           : "cuda",
    }
    model = XGBRegressor(**params)
    cv = cross_val_score(model, X, y, cv=5, scoring="neg_root_mean_squared_error")
    return -cv.mean()

# Run per target
best_params = {}
for target in ["Tg", "FFV", "Tc", "Density", "Rg"]:
    mask = ~np.isnan(df[target].values)
    study = optuna.create_study(direction="minimize")
    study.optimize(lambda t: objective(t, X_xgb[mask], df[target].values[mask]),
                   n_trials=100, show_progress_bar=True)
    best_params[target] = study.best_params
    print(f"{target}: best RMSE = {study.best_value:.4f}")

Final 5-fold OOF training with tuned parameters:

from sklearn.model_selection import KFold

oof_preds = np.zeros((len(X_xgb), len(targets)))
kf = KFold(n_splits=5, shuffle=True, random_state=42)

for t_idx, target in enumerate(targets):
    y    = df[target].values
    mask = ~np.isnan(y)
    Xm, ym = X_xgb[mask], y[mask]

    for fold, (tr, val) in enumerate(kf.split(Xm)):
        model = XGBRegressor(
            **best_params[target],
            early_stopping_rounds=50,
            device="cuda",
        )
        model.fit(
            Xm[tr], ym[tr],
            eval_set=[(Xm[val], ym[val])],
            verbose=False,
        )
        oof_preds[np.where(mask)[0][val], t_idx] = model.predict(Xm[val])
Target NN CV RMSE XGBoost CV RMSE Best XGB Params (key)
Tg (°C) 18.4 11.2 max_depth=7, lr=0.03, n_est=1400
FFV 0.031 0.019 max_depth=5, lr=0.01, n_est=1800
Tc (°C) 22.1 13.8 max_depth=8, lr=0.04, n_est=1100
Density (g/cm³) 0.082 0.051 max_depth=6, lr=0.02, n_est=900
Rg (Å) 4.31 2.87 max_depth=6, lr=0.05, n_est=1000
Leaderboard Rank ~1,200th 65th (private)

8 — Feature Importance & Ensemble

Top Features (XGBoost SHAP — Tg model)

Using SHAP (SHapley Additive exPlanations) on the Tg model reveals which molecular descriptors drive glass transition temperature predictions the most:

  1. Chi1v — valence connectivity index (backbone rigidity proxy)
  2. BertzCT — molecular complexity / branching
  3. Kappa2 — shape kappa index (chain flexibility)
  4. fr_amide — amide fragment count (hydrogen-bonding groups)
  5. TPSA — topological polar surface area
  6. MolLogP — lipophilicity (hydrophobic backbone)
  7. fp_382, fp_719 — Morgan circular substructure bits
import shap, matplotlib.pyplot as plt

# Compute SHAP values for the Tg XGBoost model
explainer = shap.TreeExplainer(best_xgb_models["Tg"])
shap_values = explainer.shap_values(X_xgb[tg_mask])

# Summary plot
shap.summary_plot(
    shap_values,
    X_xgb[tg_mask],
    feature_names=feature_names,
    max_display=20,
    show=False,
)
plt.savefig("shap_tg.png", dpi=150, bbox_inches="tight")

Ensemble: Stacking NN + XGBoost OOF Predictions

The final submission stacked out-of-fold predictions from both models using a Ridge regression meta-learner, giving a small but consistent improvement over pure XGBoost on Tg and Rg targets.

from sklearn.linear_model import Ridge

# oof_nn_preds   shape: (N, 5)  — PolymerNet OOF outputs
# oof_xgb_preds  shape: (N, 5)  — XGBoost OOF outputs

stacked_preds = {}
for t_idx, target in enumerate(targets):
    mask = ~np.isnan(df[target].values)
    Xm = np.column_stack([
        oof_nn_preds[mask, t_idx],
        oof_xgb_preds[mask, t_idx],
    ])
    ym = df[target].values[mask]

    meta = Ridge(alpha=1.0)
    meta.fit(Xm, ym)
    print(f"{target} stacking weights → "
          f"NN: {meta.coef_[0]:.3f}  XGB: {meta.coef_[1]:.3f}")
    stacked_preds[target] = meta

# Example output:
# Tg   stacking weights → NN: 0.21  XGB: 0.79
# FFV  stacking weights → NN: 0.09  XGB: 0.91
# Rg   stacking weights → NN: 0.31  XGB: 0.69

7 — Usage

Installation

pip install pandas numpy torch rdkit-pypi scikit-learn xgboost tensorboard

Training

# Train the neural network
cd train_v1
python train.py

Inference

# Run inference on new SMILES data
python train_v1/infer.py \
    --model runs/best_model.pt \
    --smiles "CC(C)(C)c1ccc(OC(=O)c2ccc(C(C)(C)C)cc2)cc1"

Tech Stack

← Back to all posts