Python Scikit-Learn k-NN Imputation Matrix Completion Pandas

Objective: Develop a robust algorithm to reconstruct missing values in 1,000 electricity load curves (Linky-like data). The dataset contained approximately 69,000 synthetic curves generated by DeepCourbogen. The challenge was to restore coherent temporal dynamics, evaluated by the Mean Absolute Error (MAE) metric calculated exclusively on missing points.

The Time Series Challenge

Electricity consumption is highly periodic. A standard integer encoding creates a discontinuity between 23:30 and 00:00 (Distance = 23). In physical reality, these moments are adjacent.

Naive Approach

23h → 0h
Distance = 23 (Huge Gap)

Our Approach

Cyclic Projection
Distance ≈ 0 (Continuous)

# Cyclic Time Encoding

h_sin = sin(2 * π * h / 24)

h_cos = cos(2 * π * h / 24)

1. Weighted k-NN

Standard interpolation fails on complex consumption patterns. We implemented a custom k-NN regressor that selects the 5 nearest neighbors based on valid data points.

  • Calculates vertical bias (offset) to adjust neighbor curves.
  • Applies inverse distance weighting for final prediction.

2. Matrix Completion

We assumed the consumption matrix has a low-rank structure (users share common behaviors). We used the SoftImpute algorithm.

  • Iterative SVD with singular value thresholding.
  • Captures global trends across all 69,000 curves simultaneously.

Performance & Impact

81 kWh Final MAE Score
31st Rank Nationwide
1000 Curves Reconstructed