Implementing Monitoring on a Self-Hosted Site

Architecture
Endpoint Setup
Internal Monitoring
External Monitoring
Final Thoughts

I am hosting theobjectivedad.com from my homelab out of my residential Internet service. Although service disruptions of a self-hosted personal website won’t cost me my career I still want to make theobjectivedad.com as available as possible. This means adding an entire universe of scope that, for the most part, begins with implementing a health check and will likely end with multi-site hot-hot availability.

In this article, I will explore how I implemented health checks and monitoring on theobjectivedad.com. I cover the overall monitoring architecture, how I developed the readiness and liveliness endpoints, and how I configured internal and external monitoring.

Architecture

As shown in the diagram below, each of the 3 Nginx pod replicas hosts a copy of the website and provides two health-related endpoints. The first, https://www.theobjectivedad.com/sys/alive responds with an HTTP 200 when Nginx is running and can respond to requests. The second, https://www.theobjectivedad.com/sys/ready, responds with an HTTP 200 when the pod can serve the actual site.

flowchart LR
subgraph Cloudflare
direction TB
TUNNEL(fab:fa-cloudflare Tunnel)
end
subgraph K8sLab
direction TB

subgraph Deployment
direction LR

NGINX1("&lt;Pod>\nfas:fa-server Nginx")
NGINX2("&lt;Pod>\nfas:fa-server Nginx")
NGINX3("&lt;Pod>\nfas:fa-server Nginx")

READY1("fas:fa-file /sys/ready")
ALIVE1("fas:fa-file /sys/alive")

READY2("fas:fa-file /sys/ready")
ALIVE2("fas:fa-file /sys/alive")

READY3("fas:fa-file /sys/ready")
ALIVE3("fas:fa-file /sys/alive")

NGINX1 -.- ALIVE1
NGINX1 -.- READY1

NGINX2 -.- ALIVE2
NGINX2 -.- READY2

NGINX3 -.- ALIVE3
NGINX3 -.- READY3

end

CONNECTOR("&lt;Deployment>\nCloudflare Tunnel")
INGRESS("&lt;Ingress>\nIngress Controller")
SERVICE("&lt;Service>\nService")


TUNNEL -- "HTTP2" --> CONNECTOR
CONNECTOR -- "HTTP2" --> INGRESS
INGRESS -- "HTTP2" --> SERVICE
SERVICE -- "HTTP" --> Deployment

KUBLET("fas:fa-gear Kublet")
KUBLET ---> NGINX1
KUBLET ---> NGINX2
KUBLET ---> NGINX3

end

ROBOT("&lt;SaaS>\nfas:fa-robot Uptime Robot")
ROBOT -- "HTTPS" --> TUNNEL

Whenever a pod replica starts in the Nginx Kubernetes deployment, a sidecar process synchronizes the published content from a git repository to an emptyDir in the pod. It is important to differentiate between the pod being dead (needs a restart) vs not being ready to serve content (the sync process is running). If I only implemented a single health endpoint for both liveliness and readiness checks I would run the risk of infinite restarts if the synchronization process ran longer than the liveliness threshold.

Internally, the Kubernetes Kublet is configured to periodically check both /sys/alive and /sys/ready for each replica. After a given number of failures on /sys/alive the Kublet will attempt to restart the pod.

Externally, I am using UptimeRobot, a freemium service that periodically hits my /sys/ready endpoint to verify the site is still running.

Endpoint Setup

The entire health check project started by implementing the actual endpoints /sys/alive and /sys/ready endpoints.

The process that renders the actual site content uses Pelican which will build a static copy of the entire site based on my pre-configured theme and pre-written content. Pelican allows for the use of Jinja templates, which will come in handy when building out readiness endpoint.

Since I want the readiness check to fail if the git sync process fails, the readiness endpoint must be deployed during the git sync process. This has the side benefit of allowing me to include additional information in /sys/ready to be rendered by Pelican:

{
    "status": "OK",
    "site": {
        "url": "{{SITEURL}}",
        "rendered": "{{last_updated}}",
        "production": {%if PELICAN_PROD%}true{%else%}false{%endif +%}
    }
}

After Pelican renders the page the final output will look something like this:

{
    "status": "OK",
    "site": {
        "url": "https://www.theobjectivedad.com",
        "rendered": "2023-03-01T13:11:44Z",
        "production": true
    }
}

To configure /sys/ready to point to the rendered readiness page, I needed to make a few configuration changes to nginx.conf:

location = /sys/ready {
    # Serve rendered health information from file
    try_files /ready.json =404;
    add_header Content-Type "application/json";

    # Advise clients to never cache
    add_header Cache-Control "no-cache, no-store, must-revalidate";
    add_header Pragma "no-cache";
    expires 0;

    # Disable logging 
    log_not_found off;
    access_log off;

}

The previous configuration binds the /sys/ready location to the ready.json file rendered by Pelican and deployed by the synchronization process or returns an HTTP 404 status if ready.json is not found. Additionally, I am adding several response headers to signal to clients to not cache old copies of the response which would defeat the purpose of the health check endpoint. Finally, as a matter of personal preference, I disable logging to the health check endpoint.

Next, I needed to add another location to nginx.conf to enable the /sys/alive endpoint. While it is similar to the previous block the important difference is that it isn’t looking for an external page and instead uses the return directive to respond with a simple JSON payload. This is sufficient to verify that Nginx has started and is capable of responding to requests.

location = /sys/alive {
    # If the server is up, return a simple JSON response and HTTP 200
    return 200 "{\"status\": \"OK\"}";
    add_header Content-Type "application/json";

    # Disable logging 
    log_not_found off;
    access_log off;

    # Advise clients to never cache
    add_header Cache-Control "no-cache, no-store, must-revalidate";
    add_header Pragma "no-cache";
    expires 0;
}

The final step was to add a page rule to the Cloudflare configuration to ensure the health check page is never cached. This is important for both automated external monitoring and anyone else who wants to see a real (not cached) status page via the browser:

Internal Monitoring

Implementing Kubernetes health check probes could be an entire article in itself and requires some knowledge about how a pod behaves as well as how Kubernetes handles liveliness and readiness failures. To keep things simple let’s take a look at the important minimum / maximum times I’m anticipating for the pods hosting this site in the table below.

Condition	Probe	Formula
Pre startup, minimum time before pod is ready	`startupProbe`	\(\text{initialDelaySeconds}\)
Pre startup, time before pod is restarted due to not ready	`startupProbe`	\(\text{initialDelaySeconds} + (\text{periodSeconds} \times \text{failureThreshold})\)
Post startup, minimum time before the pod is restarted due to liveliness failure	`livenessProbe`	\(\text{periodSeconds} \times \text{failureThreshold}\)
Post startup, minimum time before traffic is redirected away from the pod	`readinessProbe`	\(\text{periodSeconds} \times \text{failureThreshold}\)

Again, there are more considerations here than I mentioned but this does represent the major points. To implement the internal probes I added the following configuration to the deployment manifest:

Startup probe:

startupProbe:
    httpGet:
        path: /sys/ready
        port: 8080
    periodSeconds: 5
    failureThreshold: 100
    successThreshold: 1

Liveliness probe:

livenessProbe:
    httpGet:
        path: /sys/alive
        port: 8080
    periodSeconds: 5
    failureThreshold: 3
    successThreshold: 1

Readiness probe:

readinessProbe:
    httpGet:
        path: /sys/ready
        port: 8080
    periodSeconds: 5
    failureThreshold: 3
    successThreshold: 1

External Monitoring

With the health check endpoint in place, I was ready to add external monitoring. This isn’t a space I have much expertise in however a quick Google search for “free website monitoring” led me to UptimeRobot which seemed good enough.

While the free tier has some restrictions such as a minimum check interval of 5 minutes is adequate for a self-hosted personal website. The setup was simple enough and even offers a publicly facing dashboard:

Although not everyone would care about it, having a decent mobile app was certainly nice. Best yet, it is available in the free tier:

Final Thoughts

This is a fairly good start and adds a useful capability to my toolbox. I can’t understate this enough because the next time I need to design or implement a health check endpoint or monitoring service I’ll have a working model to start with.

While going through this exercise I did identify a few additional enhancements that I would like to have but didn’t have time to implement including:

Add jq to the Nginx container so the Kublet can run assertions on the JSON responses coming back vs looking at the status code.
Add a sidecar that will periodically update ready.json with additional information such as an instance identifier (which can be helpful for troubleshooting)

The next milestone in the monitoring and availability space for me will be to create a secondary “data center” that I can host this site from. Being a statically-generated site it shouldn’t be that difficult but will require some additional changes to my CI/CD process. I also don’t want to get hooked into any monthly costs which means I’ll need to set up a workflow start and stop infrastructure as well as explore routing options available in the Cloudflare free tier.