Open Polymers is a machine learning project for predicting the physical and thermodynamic properties of polymers directly from their SMILES string representations. It combines classical molecular descriptor engineering (via RDKit) with both deep learning (PyTorch) and gradient-boosted trees (XGBoost), ultimately finishing at 656th out of 2,400+ teams — jumping from rank ~1,200 to 65th after integrating XGBoost.
Each row in the dataset represents a unique polymer described by its SMILES string, with the following target properties:
| Column | Property | Unit / Notes |
|---|---|---|
id |
Unique polymer identifier | — |
SMILES |
Molecular structure string | RDKit-parseable representation |
Tg |
Glass transition temperature | °C — material processability |
FFV |
Fractional free volume | Dimensionless — gas permeability proxy |
Tc |
Decomposition temperature | °C — thermal stability |
Density |
Polymer density | g/cm³ |
Rg |
Radius of gyration | Å — chain conformation measure |
Polymer datasets are notoriously sparse — labs only measure specific properties for each compound. The Data creation module trains separate imputation models for each of the five targets:
FFV, Tg, Density, Tc,
Rg) gets its own imputer trained on rows where that target is observed
DATA/ for reproducibility and ablationRaw SMILES strings are converted into numerical feature vectors using RDKit. The final feature set was 1,224-dimensional: 208 physicochemical descriptors + 1,024-bit Morgan fingerprint (radius=2).
| Descriptor Family | Count | Examples |
|---|---|---|
| Constitutional | ~30 | MW, HeavyAtomCount, NumHeteroatoms, RingCount |
| Topological | ~40 | BertzCT, Chi0, Chi1, Kappa1-3, HallKierAlpha |
| Electronic / Charge | ~20 | MaxPartialCharge, MinAbsPartialCharge, TPSA |
| Lipophilicity | ~10 | MolLogP, MolMR, LabuteASA |
| Fragment counts | ~80 | fr_amide, fr_ArN, fr_ether, fr_C_O, … |
| VSA bins (MOE-style) | ~30 | PEOE_VSA1–14, SMR_VSA1–10, SlogP_VSA1–12 |
| Morgan Fingerprint | 1,024 | Circular substructure bits (radius=2) |
from rdkit import Chem
from rdkit.Chem import Descriptors, rdMolDescriptors
import pandas as pd, numpy as np
def smiles_to_descriptors(smiles: str) -> dict:
mol = Chem.MolFromSmiles(smiles)
if mol is None:
return {}
# 208 physicochemical descriptors
desc = {name: fn(mol) for name, fn in Descriptors.descList}
# 1024-bit Morgan fingerprint (radius=2)
fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=1024)
desc.update({f"fp_{i}": int(b) for i, b in enumerate(fp)})
return desc
df_features = pd.DataFrame([smiles_to_descriptors(s) for s in df["SMILES"]])
# Drop descriptors with >50% NaN (usually invalid ring/charge descriptors)
df_features = df_features.loc[:, df_features.isna().mean() < 0.5]
df_features = df_features.fillna(df_features.median())
A PCA was run on the full descriptor space to visualise the chemical diversity of the dataset:
Before feeding descriptors into either model, a careful preprocessing pipeline is applied. The neural network is especially sensitive to feature scale, while XGBoost is invariant to monotonic transforms but benefits from variance-based feature selection.
from sklearn.preprocessing import RobustScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.pipeline import Pipeline
import numpy as np
# Step 1 – remove zero-variance (constant) features
selector = VarianceThreshold(threshold=0.01)
X_sel = selector.fit_transform(df_features.values)
# Step 2 – robust scaling (median/IQR — insensitive to outliers)
# applied only for the neural network path
scaler = RobustScaler()
X_nn = scaler.fit_transform(X_sel) # ← used by PolymerNet
X_xgb = X_sel # ← raw for XGBoost
# Step 3 – per-target label scaling (avoids the NN chasing large-scale targets)
from sklearn.preprocessing import StandardScaler
target_scalers = {}
for col in ["Tg", "FFV", "Tc", "Density", "Rg"]:
ts = StandardScaler()
mask = ~df[col].isna()
df.loc[mask, col + "_scaled"] = ts.fit_transform(
df.loc[mask, [col]]
).ravel()
target_scalers[col] = ts # saved for inverse-transform at inference
print(f"Feature dims after selection: {X_sel.shape[1]} / {df_features.shape[1]}")
# >> Feature dims after selection: 987 / 1224
A fully-connected feed-forward network that takes the RDKit descriptor vector as input and predicts all five targets simultaneously (multi-output regression). Batch normalisation and dropout are used to prevent overfitting on the small dataset.
import torch
import torch.nn as nn
class PolymerNet(nn.Module):
def __init__(self, input_dim: int, output_dim: int = 5):
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, 512),
nn.BatchNorm1d(512),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(512, 256),
nn.BatchNorm1d(256),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(256, 128),
nn.ReLU(),
nn.Linear(128, output_dim),
)
def forward(self, x):
return self.net(x)
# Training setup
model = PolymerNet(input_dim=X_train.shape[1]).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=10)
criterion = nn.MSELoss()
# TensorBoard logging
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter(log_dir="runs/polymer_nn_v1")
for epoch in range(300):
model.train()
pred = model(X_train_t)
loss = criterion(pred, y_train_t)
optimizer.zero_grad()
loss.backward()
optimizer.step()
scheduler.step(loss)
writer.add_scalar("Loss/train", loss.item(), epoch)
Monitor training with TensorBoard:
tensorboard --logdir runs
After the neural network plateaued, XGBoost regressors were trained per-target using the same RDKit descriptor features (987-dim after variance filtering). This single switch drove the biggest leaderboard jump of the project.
Each target's XGBoost was tuned independently using Optuna with 100 trials of TPE (Tree-structured Parzen Estimator) search, optimising 5-fold CV RMSE on the observed rows for that target.
import optuna
from xgboost import XGBRegressor
from sklearn.model_selection import cross_val_score
import numpy as np
def objective(trial, X, y):
params = {
"n_estimators" : trial.suggest_int("n_estimators", 200, 2000),
"learning_rate" : trial.suggest_float("lr", 0.005, 0.3, log=True),
"max_depth" : trial.suggest_int("max_depth", 3, 10),
"subsample" : trial.suggest_float("subsample", 0.5, 1.0),
"colsample_bytree" : trial.suggest_float("colsample_bytree", 0.4, 1.0),
"colsample_bylevel": trial.suggest_float("colsample_bylevel", 0.4, 1.0),
"reg_alpha" : trial.suggest_float("alpha", 1e-4, 10.0, log=True),
"reg_lambda" : trial.suggest_float("lambda", 1e-4, 10.0, log=True),
"min_child_weight" : trial.suggest_int("min_child_weight", 1, 20),
"eval_metric" : "rmse",
"device" : "cuda",
}
model = XGBRegressor(**params)
cv = cross_val_score(model, X, y, cv=5, scoring="neg_root_mean_squared_error")
return -cv.mean()
# Run per target
best_params = {}
for target in ["Tg", "FFV", "Tc", "Density", "Rg"]:
mask = ~np.isnan(df[target].values)
study = optuna.create_study(direction="minimize")
study.optimize(lambda t: objective(t, X_xgb[mask], df[target].values[mask]),
n_trials=100, show_progress_bar=True)
best_params[target] = study.best_params
print(f"{target}: best RMSE = {study.best_value:.4f}")
Final 5-fold OOF training with tuned parameters:
from sklearn.model_selection import KFold
oof_preds = np.zeros((len(X_xgb), len(targets)))
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for t_idx, target in enumerate(targets):
y = df[target].values
mask = ~np.isnan(y)
Xm, ym = X_xgb[mask], y[mask]
for fold, (tr, val) in enumerate(kf.split(Xm)):
model = XGBRegressor(
**best_params[target],
early_stopping_rounds=50,
device="cuda",
)
model.fit(
Xm[tr], ym[tr],
eval_set=[(Xm[val], ym[val])],
verbose=False,
)
oof_preds[np.where(mask)[0][val], t_idx] = model.predict(Xm[val])
| Target | NN CV RMSE | XGBoost CV RMSE | Best XGB Params (key) |
|---|---|---|---|
| Tg (°C) | 18.4 | 11.2 | max_depth=7, lr=0.03, n_est=1400 |
| FFV | 0.031 | 0.019 | max_depth=5, lr=0.01, n_est=1800 |
| Tc (°C) | 22.1 | 13.8 | max_depth=8, lr=0.04, n_est=1100 |
| Density (g/cm³) | 0.082 | 0.051 | max_depth=6, lr=0.02, n_est=900 |
| Rg (Å) | 4.31 | 2.87 | max_depth=6, lr=0.05, n_est=1000 |
| Leaderboard Rank | ~1,200th | 65th (private) | — |
Using SHAP (SHapley Additive exPlanations) on the Tg model reveals which molecular descriptors drive glass transition temperature predictions the most:
Chi1v — valence connectivity index (backbone rigidity proxy)BertzCT — molecular complexity / branchingKappa2 — shape kappa index (chain flexibility)fr_amide — amide fragment count (hydrogen-bonding groups)TPSA — topological polar surface areaMolLogP — lipophilicity (hydrophobic backbone)fp_382, fp_719 — Morgan circular substructure bitsimport shap, matplotlib.pyplot as plt
# Compute SHAP values for the Tg XGBoost model
explainer = shap.TreeExplainer(best_xgb_models["Tg"])
shap_values = explainer.shap_values(X_xgb[tg_mask])
# Summary plot
shap.summary_plot(
shap_values,
X_xgb[tg_mask],
feature_names=feature_names,
max_display=20,
show=False,
)
plt.savefig("shap_tg.png", dpi=150, bbox_inches="tight")
The final submission stacked out-of-fold predictions from both models using a Ridge regression meta-learner, giving a small but consistent improvement over pure XGBoost on Tg and Rg targets.
from sklearn.linear_model import Ridge
# oof_nn_preds shape: (N, 5) — PolymerNet OOF outputs
# oof_xgb_preds shape: (N, 5) — XGBoost OOF outputs
stacked_preds = {}
for t_idx, target in enumerate(targets):
mask = ~np.isnan(df[target].values)
Xm = np.column_stack([
oof_nn_preds[mask, t_idx],
oof_xgb_preds[mask, t_idx],
])
ym = df[target].values[mask]
meta = Ridge(alpha=1.0)
meta.fit(Xm, ym)
print(f"{target} stacking weights → "
f"NN: {meta.coef_[0]:.3f} XGB: {meta.coef_[1]:.3f}")
stacked_preds[target] = meta
# Example output:
# Tg stacking weights → NN: 0.21 XGB: 0.79
# FFV stacking weights → NN: 0.09 XGB: 0.91
# Rg stacking weights → NN: 0.31 XGB: 0.69
pip install pandas numpy torch rdkit-pypi scikit-learn xgboost tensorboard
# Train the neural network
cd train_v1
python train.py
# Run inference on new SMILES data
python train_v1/infer.py \
--model runs/best_model.pt \
--smiles "CC(C)(C)c1ccc(OC(=O)c2ccc(C(C)(C)C)cc2)cc1"
tensorboard --logdir runs)