---
title: "Using the covariate approach"
author: "Charles Thraves, Pablo Ubilla, Daniel Hermosilla"
date: "`r Sys.Date()`"

vignette: >
  %\VignetteIndexEntry{Using the covariate approach}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}

output:
  rmarkdown::html_vignette:
    toc: false
    number_sections: false
---
In addition to the non-covariate approach presented in the main vignette (see `demonstration.Rmd`), `fastei` also supports a covariate approach that incorporates electoral-unit–level covariates. This approach follows the same general strategy of maximizing the log-likelihood via an EM algorithm, with the M-step carried out through a multinomial formulation.

The covariate specification requires computing a Hessian matrix, which can be computationally costly; therefore, an approximation is used in practice. In addition, parameter updates within each M-step are obtained via a Newton–Raphson procedure. Empirically, performing a single Newton step (i.e., `maxnewton = 1`) is typically sufficient to achieve convergence.

The main difference from the non-parametric approach is that the probability matrix is defined at the electoral-unit level rather than as a single global probability matrix. This enables the inclusion of electoral-unit covariates, which may better capture and explain variation in voting behavior across units. Consequently, convergence of the EM algorithm is assessed solely through increases in the log-likelihood.

Under this model, probabilities are computed using a multinomial softmax. Specifically, letting $p_{bgc}$ denote the probability associated with electoral unit $b$, group $g$, and candidate $c$, we define:

$$
p_{bgc}=
\frac{\exp\!\big(\beta_{gc} + (V\alpha^{T})_{bc}\big)}
{1+\sum_{j=1}^{C-1}\exp\!\big(\beta_{gj} + (V\alpha^{T})_{bj}\big)}.
$$

Thus, the initial probabilities depend on the parameters `alpha` and `beta`, which are set to zero by default if not provided.

## Simulating a parametric election

An election can be simulated using the function `simulate_election()`. When employing the covariate approach, it is necessary to specify both the number of covariates (`num_covariates`) and the number of districts (`num_districts`). In practice, a district may represent a collection of ballot boxes that share the same covariate values. In the example below, we simulate an election with three candidates, two demographic groups, and 20 covariates.

```{r}
library(fastei)

election <- simulate_election(
    num_ballots = 100,
    num_candidates = 3,
    num_groups = 2,
    num_covariates = 20,
    num_districts = 10,
    seed = 42
)
```

Note that the covariate setting must be explicitly enabled by specifying `num_covariates` and `num_districts` distinctly from zero.

The results of the simulation can be inspected by printing the resulting object. The simulated election is returned as an `eim` object.

```{r}
election
```

## Running the model

The model is estimated using the same function as in the non-covariate case. A minimal example is shown below:

```{r}
result <- run_em(election)
```
The quality of the estimated solution can be assessed by computing the mean absolute error (MAE):
```{r}
mae <- mean(abs(result$prob - result$real_prob))
mae
```

If a global probability matrix—analogous to that obtained in the non-parametric approach—is desired, it can be computed as a weighted average using the group sizes contained in `W`. Formally, the global probability for group ($g$) and candidate ($c$) is defined as

$$P_{gc} = \frac{\sum_b W_{bg}, p_{gcb}}{\sum_b W_{bg}}$$

The following code snippet implements this aggregation (note that we could also get the global matrix by using the `as.matrix()` function):
```{r}
global_prob <- function(W, prob) {
    G <- dim(prob)[1]
    C <- dim(prob)[2]
    P <- matrix(NA_real_, G, C)
    for (g in seq_len(G)) {
        w <- W[, g]
        P[g, ] <- colSums(t(prob[g, , ]) * w) / sum(w)
    }
    return(P)
}
```

Using this function, the estimated global probability matrix can be obtained as follows:
```{r}
estimated_global_probability <- global_prob(result$W, result$prob)
estimated_global_probability
```
The MAE for the global probabilities is then computed by comparing the estimated matrix with the true global probabilities:
```{r}
real_global_probability <- global_prob(election$W, election$real_prob)

parametric_global_mae <- mean(abs(estimated_global_probability - real_global_probability))
parametric_global_mae
```
Finally, the contribution of covariates to estimation accuracy can be evaluated by comparing the covariate model with the non-covariate approach. This is achieved by removing the covariate matrix V and re-estimating the model:

```{r}
non_parametric_election <- election
non_parametric_election$V <- NULL
non_parametric_result <- run_em(non_parametric_election)

non_parametric_mae <- mean(abs(non_parametric_result$prob - real_global_probability))

improvement <- (non_parametric_mae - parametric_global_mae) /
    non_parametric_mae * 100
improvement
```
The resulting value represents the percentage reduction in MAE achieved by incorporating covariates, relative to the non-parametric baseline.

## Reducing dimensionality with PCA

When working with a large number of covariates, it may be advantageous to reduce dimensionality using Principal Component Analysis (PCA). Dimensionality reduction can improve both computational efficiency and model interpretability. The function `PCA` allows the covariate matrix `V` to be transformed prior to running the EM algorithm.

Dimensionality can be reduced either by explicitly specifying the number of principal components to retain via the argument `components`, or by providing a threshold for the proportion of explained variance using `sd_threshold`. The example below illustrates how PCA can be applied to the covariate matrix `V` in the simulated election:

```{r}
PCA_election <- PCA(election, components = 2)
dim(PCA_election$V)
```
As expected, the resulting covariate matrix contains only two columns, corresponding to the two retained principal components. The EM algorithm can then be run on this modified election object:

```{r}
pca_result <- run_em(PCA_election)
pca_global_probability <- global_prob(pca_result$W, pca_result$prob)
pca_global_mae <- mean(abs(pca_global_probability - real_global_probability))
```
In some cases, reducing the number of covariates can lead to improved estimation accuracy. This can be evaluated by comparing the PCA-based model with the non-parametric baseline:
```{r}
pca_improvement <- (non_parametric_mae - pca_global_mae) /
    non_parametric_mae * 100
pca_improvement
```
The resulting value represents the percentage reduction in MAE achieved by applying PCA relative to the non-parametric approach.