Phase 1 β€” Week 1

DevOps Career Roadmap β€” NCR, 2025–26

A self-contained reference page built from the full pasted roadmap text, with all sections preserved in one offline HTML file.

Dark knowledge-base layout Offline-friendly Searchable by headings Source text preserved below

DevOps Career Roadmap β€” NCR, 2025–26


DSA β€” Answered Directly

Do NCR DevOps, Cloud Support, and Infrastructure roles screen for LeetCode-style DSA?

At funded startups targeting DevOps specifically: no. The technical screen is scripting (Python or Bash), Linux troubleshooting scenarios, and Docker/Compose tasks. No arrays, graphs, or dynamic programming.

At product companies and GCCs: mostly no, but with a real exception. Some GCCs (Optum, Genpact Tech, Publicis Sapient) route all applicants through a standardized first-round online assessment that includes basic coding β€” not algorithmic complexity, but "write a function that does X" type problems a second-year CS student should handle. If you apply to these through their career portal (not via referral or recruiter), you may hit this screen. The coding is at the level of "reverse a string without slicing" or "count words in a paragraph" β€” not LeetCode mediums. If you've written Python for two years, you're fine. No prep needed.

At service companies (TCS, Wipro, HCL, Infosys): their entry tests include aptitude + basic coding. The coding section is trivial CS101 material, not competitive programming. If this is the blocker, 2 hours of practice on basic Python problems clears it.

Conclusion: No LeetCode prep is needed. No LeetCode appears in this roadmap under any framing. The only coding you'll write as interview prep is ops-relevant Python scripting, which is on the roadmap as a skill anyway.


Technology Rationale Table

Technology Immediate Employability Impact Learnable on the Job? Why It's Here
Linux (ops-specific) High Partially Every VPS, EC2, and container runs Linux; ops-level commands are tested directly in technical rounds
Git (infra conventions) Medium Yes Ops repos, GitOps patterns, and READMEs are how hiring managers evaluate your actual work product
Networking High Partially DNS, SSH, firewall, and HTTP debugging appear explicitly in junior DevOps technical screens in NCR
Python scripting High Partially Automation scripts are the most common take-home task format; real scripts separate you from CV-padders
Docker High No In nearly every junior DevOps JD in NCR; multi-stage builds and Compose are now baseline expectations
Docker Compose High Partially Multi-service Compose stacks are the standard deployment unit at companies without Kubernetes
Nginx High Partially Reverse proxy knowledge is expected at junior level; SSL termination and upstreams come up in interviews
GitHub Actions High Yes CI/CD is table stakes; VPS deploy pipelines are more transferable than Cloudflare-specific ones
AWS EC2/IAM/S3/VPC High Partially Most NCR product companies and GCCs run on AWS; differentiates you from candidates who only know on-prem
CloudWatch Medium Yes Logging and basic monitoring show operational awareness; AWS-native so low friction to learn
Terraform Medium Yes Increasingly in mid-level JDs; rarely a hard gate at junior level in NCR, but Phase 3 depth pays off post-hire
Prometheus Medium Yes Monitoring stack is a bonus at junior level; signals genuine ops thinking beyond just deployment
Grafana Medium Yes Dashboard work demonstrates operational maturity; trivial to add once Prometheus is running
Kubernetes Medium No Appearing in NCR JDs even at junior level now; large surface area justifies Phase 3 placement

PHASE 1: Weeks 1–4 β€” Become Employable


Week 1 β€” Linux (Ops-Specific) + Git (Infra Conventions)

What this week unlocks: Your VPS becomes a real work environment, not a toy. You can answer Linux troubleshooting questions in interviews. FactorSphere gets professional documentation that turns a live project into an interview story.

Study hours: 42

Learning objectives:

  • Ops-level networking stack: reading active connections, capturing traffic, diagnosing from the command line
  • SSH hardening: key-only auth, non-root user, config file discipline
  • systemd: writing, enabling, and managing service units from scratch
  • Bash scripting for automation: health checks, backups, log analysis
  • Process and resource management at ops level
  • Log analysis with journalctl and standard log tooling
  • How infrastructure repos differ from application repos; writing READMEs a hiring manager actually reads
  • Professional documentation for FactorSphere CI/CD pipeline and architecture

Technologies: Linux (Ubuntu 22.04 on Hetzner VPS), Bash, systemd, UFW, SSH, journalctl, Git


Monday–Tuesday: VPS baseline and network stack

SSH into the Hetzner VPS. If you're logging in as root, fix that first.

adduser deploy
usermod -aG sudo deploy

SSH hardening β€” edit /etc/ssh/sshd_config:

PasswordAuthentication no
PermitRootLogin no
AllowUsers deploy

Copy your public key to the new user:

ssh-copy-id -i ~/.ssh/id_ed25519.pub deploy@<vps-ip>

Generate a dedicated key if you don't have one: ssh-keygen -t ed25519 -C "vps-deploy-key"

Restart sshd: systemctl restart sshd. Verify you can log in as deploy before ending the root session.

UFW setup:

ufw default deny incoming
ufw default allow outgoing
ufw allow from <your-home-ip> to any port 22
ufw allow 80/tcp
ufw allow 443/tcp
ufw enable
ufw status verbose

Networking stack commands β€” run these and understand every line of output:

ip addr show
ip route show
ss -tlnp          # listening TCP sockets with process names
ss -tulnp         # TCP + UDP
netstat -tlnp     # older but still appears in interviews

Install tcpdump: apt install tcpdump. Run:

tcpdump -i eth0 port 22   # watch your own SSH session
tcpdump -i eth0 port 80 -A  # see HTTP traffic in ASCII

You're not becoming a packet analysis expert. You're learning to answer "how would you debug a connectivity issue" with real commands.


Wednesday: systemd

Write a real systemd service. Create a minimal Python HTTP server first:

# /opt/healthapi/server.py
import http.server, json
class Handler(http.server.BaseHTTPRequestHandler):
    def do_GET(self):
        self.send_response(200)
        self.send_header('Content-Type', 'application/json')
        self.end_headers()
        self.wfile.write(json.dumps({"status": "ok"}).encode())
    def log_message(self, *args): pass

http.server.HTTPServer(('', 8080), Handler).serve_forever()

Service unit /etc/systemd/system/healthapi.service:

[Unit]
Description=Health API
After=network.target

[Service]
Type=simple
User=deploy
WorkingDirectory=/opt/healthapi
ExecStart=/usr/bin/python3 /opt/healthapi/server.py
Restart=on-failure
RestartSec=5
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
systemctl daemon-reload
systemctl enable healthapi
systemctl start healthapi
systemctl status healthapi
curl http://localhost:8080
journalctl -u healthapi -f       # tail logs
journalctl -u healthapi --since "1 hour ago"
journalctl -p err -u healthapi   # errors only

Stop the service, break it intentionally (wrong path), observe systemctl status output. This is what debugging looks like.


Thursday: Bash scripting for automation

Write three scripts. These go on GitHub. They are real deliverables, not exercises.

health-check.sh β€” VPS health report:

#!/bin/bash
set -euo pipefail

THRESHOLD_DISK=80
THRESHOLD_MEM=90

echo "=== VPS Health Check $(date) ==="

# Disk
DISK_USE=$(df / | awk 'NR==2 {print $5}' | tr -d '%')
echo "Disk usage: ${DISK_USE}%"
[ "$DISK_USE" -gt "$THRESHOLD_DISK" ] && echo "WARNING: disk above ${THRESHOLD_DISK}%" >&2

# Memory
MEM_TOTAL=$(free -m | awk '/Mem:/ {print $2}')
MEM_USED=$(free -m | awk '/Mem:/ {print $3}')
MEM_PCT=$(( MEM_USED * 100 / MEM_TOTAL ))
echo "Memory usage: ${MEM_PCT}% (${MEM_USED}/${MEM_TOTAL} MB)"

# Services β€” edit list for your environment
for SVC in healthapi nginx docker; do
    STATUS=$(systemctl is-active "$SVC" 2>/dev/null || echo "not-installed")
    echo "Service $SVC: $STATUS"
done

# Open ports
echo "Listening ports:"
ss -tlnp | awk 'NR>1 {print $1, $4, $6}'

backup.sh β€” timestamped archive:

#!/bin/bash
set -euo pipefail

SRC="${1:?Usage: backup.sh <source-dir>}"
DEST="/var/backups"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
ARCHIVE="${DEST}/backup_${TIMESTAMP}.tar.gz"

mkdir -p "$DEST"
tar czf "$ARCHIVE" "$SRC"
echo "Backup created: $ARCHIVE ($(du -sh "$ARCHIVE" | cut -f1))"

# Keep last 7 backups
ls -t "${DEST}"/backup_*.tar.gz | tail -n +8 | xargs -r rm
echo "Old backups pruned. Current count: $(ls "${DEST}"/backup_*.tar.gz | wc -l)"

log-analyzer.sh β€” parse journalctl for errors in the last N hours:

#!/bin/bash
HOURS="${1:-24}"
echo "=== Error summary: last ${HOURS} hours ==="
journalctl --since "${HOURS} hours ago" -p err --no-pager | \
    awk '{print $5}' | sort | uniq -c | sort -rn | head -20

Make them executable (chmod +x), test them, commit them with a meaningful message: feat(scripts): add VPS health check, backup, and log analysis scripts.


Friday–Saturday: Process management, disk, LVM awareness, log deep dive

Process management:

ps aux --sort=-%cpu | head -10    # top CPU consumers
ps aux --sort=-%mem | head -10    # top memory consumers
kill -9 <pid>                     # SIGKILL
kill -15 <pid>                    # SIGTERM (graceful)
nice -n 10 <command>              # lower priority
renice -n 5 -p <pid>              # change running process priority

top and htop β€” know what load average means. A load average of 1.0 on a single-core machine = 100% utilization. On a 4-core machine, 1.0 = 25%. This comes up in interviews.

Disk:

df -h                  # filesystem usage
du -sh /var/*          # directory sizes
lsblk                  # block devices
fdisk -l               # partition table

LVM β€” you may not have LVM on the VPS, but understand the commands:

pvs    # physical volumes
vgs    # volume groups
lvs    # logical volumes

Log analysis:

journalctl --since "2025-01-01" --until "2025-01-02"
journalctl -u nginx --no-pager | grep "502" | wc -l
grep -E "ERROR|WARN" /var/log/syslog | tail -50
zcat /var/log/syslog.2.gz | grep ERROR   # compressed log files

Sunday: Git infra conventions + FactorSphere documentation

This is the most interview-impactful work of the week.

Create docs/ in the FactorSphere repo. Write two files:

ARCHITECTURE.md β€” cover: why Cloudflare Workers instead of a traditional server (latency, no cold starts at edge, cost at zero users), how requests flow (DNS β†’ Cloudflare edge β†’ Worker β†’ Pinecone/LLM β†’ response), data flow for the ranking pipeline (source aggregation β†’ processing Workers β†’ Pinecone indexing β†’ query Workers), why Pinecone over a hosted Postgres vector extension (managed, no infra to maintain), how the frontend on Cloudflare Pages is decoupled from the Worker API, what the tradeoffs are (no persistent state in Workers, Pinecone vendor lock-in, cold start behavior). Use a Mermaid diagram:

graph LR
  User --> CF_Edge[Cloudflare Edge]
  CF_Edge --> Worker_API[Workers API]
  Worker_API --> Pinecone[(Pinecone Vector DB)]
  Worker_API --> LLM[LLM Inference]
  CF_Pages --> Worker_API

CICD.md β€” cover: what triggers a deploy (push to main), the GitHub Actions workflow steps (lint β†’ type check β†’ Wrangler deploy), what wrangler deploy actually does (bundles the Worker, uploads to Cloudflare's edge network), how Cloudflare Pages handles frontend deploys automatically (build hook on push), what happens on failure (GitHub Actions marks the workflow run as failed, Wrangler does not promote the broken version β€” previous version stays live), how to roll back (revert commit + push, or wrangler rollback to a previous deployment ID), what environment variables are injected at deploy time vs stored as Cloudflare secrets.

Write this so you can recite it verbally in a 3-minute interview answer. That's the test.

Infra repo conventions:

  • READMEs in infrastructure repos answer: what this runs, how to run it, what environment variables it needs, what the architecture looks like, and what decisions were made and why
  • Application repos explain features; infrastructure repos explain operations
  • Commit messages in infra repos are imperative and specific: fix(nginx): increase worker_connections to handle spike load not update config
  • Store .env.example with all variable names and no values. Never .env.

Deliverables, Week 1:

  • Hetzner VPS: non-root deploy user, SSH key-only auth, UFW configured, healthapi systemd service running
  • GitHub repo ops-scripts: health-check.sh, backup.sh, log-analyzer.sh with meaningful commits and a README
  • FactorSphere repo: docs/ARCHITECTURE.md and docs/CICD.md β€” professional, interview-ready

Week 2 β€” Networking (Practical)

What this week unlocks: You can answer every networking question in a junior DevOps technical round. DNS debugging, HTTP troubleshooting, SSH advanced usage, and firewall diagnosis are the most common technical screen topics. This week makes you competent at all of them.

Study hours: 42

Learning objectives:

  • DNS: resolution chain, record types, TTL, cache behavior, practical debugging tools
  • HTTP: headers, status codes, TLS handshake, what curl reveals
  • TCP/IP: three-way handshake, port states, socket inspection
  • SSH: config file, tunneling, port forwarding, agent forwarding
  • Firewalls: UFW rule management, iptables literacy, nftables awareness
  • tcpdump for real diagnosis

Monday–Tuesday: DNS

Install: apt install dnsutils on the VPS if not present.

dig factorsphere.org                    # full answer section
dig +short factorsphere.org             # just the IP
dig +trace factorsphere.org             # full resolution chain from root
dig @8.8.8.8 factorsphere.org           # force a specific resolver
dig @1.1.1.1 factorsphere.org           # Cloudflare resolver
dig MX factorsphere.org                 # mail records
dig TXT factorsphere.org                # SPF, DKIM, verification records
dig CNAME www.factorsphere.org          # canonical name
dig -x <ip-address>                     # reverse lookup

Read the AUTHORITY SECTION and ADDITIONAL SECTION in dig output. Understand what TTL means β€” if you change a DNS record, traffic won't switch until TTL expires. This is how to answer "how long does DNS propagation take?"

Watch DNS queries in real time:

tcpdump -i any port 53 -n
# Open another terminal and run: dig google.com
# Watch the query and response packets appear

Understand /etc/resolv.conf (which nameserver your system queries) and /etc/hosts (local override, checked before DNS). Add an entry to /etc/hosts that maps a fake hostname to localhost, verify it resolves, then remove it.

Configure a real subdomain if you have a domain β€” point vps.yourdomain.com to your Hetzner IP as an A record. Verify with dig. This demonstrates you've actually managed DNS, not just read about it.


Wednesday: HTTP in depth

curl -v https://factorsphere.org            # verbose: see TLS handshake, headers, body
curl -I https://factorsphere.org            # HEAD request only β€” no body
curl -L https://factorsphere.org            # follow redirects
curl -X POST https://api.example.com/endpoint \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"key": "value"}'
curl -o /dev/null -s -w "%{http_code}\n" https://factorsphere.org  # just status code

Read every header in the curl -v output. Know what these mean:

  • Cache-Control: max-age=86400 β€” browser can cache for 24 hours
  • X-Forwarded-For β€” original client IP when behind a proxy
  • Strict-Transport-Security β€” forces HTTPS
  • CF-RAY β€” Cloudflare request ID, useful for debugging Workers
  • Content-Encoding: gzip β€” response is compressed

HTTP status codes you must know cold: 200, 201, 204, 301, 302, 304, 400, 401, 403, 404, 422, 429, 500, 502, 503, 504. Know the difference between 401 and 403, between 502 and 503.

TLS handshake β€” be able to describe: client hello β†’ server hello + certificate β†’ client verifies cert β†’ key exchange β†’ symmetric session established. Not cryptography depth, but the sequence.

openssl s_client -connect factorsphere.org:443    # inspect the TLS certificate
openssl x509 -in cert.pem -noout -dates          # check cert expiry

Thursday: SSH advanced

~/.ssh/config β€” create this file:

Host vps
    HostName <your-vps-ip>
    User deploy
    IdentityFile ~/.ssh/id_ed25519
    ServerAliveInterval 60

Host github.com
    IdentityFile ~/.ssh/id_ed25519_github
    User git

Now ssh vps instead of ssh -i ~/.ssh/id_ed25519 deploy@<ip>.

Local port forwarding β€” access a service on the VPS that isn't exposed publicly:

ssh -L 5432:localhost:5432 vps
# Now psql -h localhost -p 5432 connects to Postgres on the VPS

Remote port forwarding β€” expose a local service through the VPS (useful for demos):

ssh -R 8080:localhost:3000 vps
# Traffic to vps:8080 now forwards to your local machine's port 3000

ProxyJump β€” hop through a bastion:

ssh -J bastion.example.com internal-server.example.com
# Or in config:
# Host internal
#     ProxyJump bastion

SSH agent:

eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_ed25519
ssh-add -l    # list loaded keys

Friday–Saturday: Firewalls and network diagnostics

UFW advanced:

ufw status numbered                     # numbered rules for deletion
ufw delete 3                            # delete rule 3
ufw allow from 10.0.0.0/8 to any port 5432  # Postgres from internal network only
ufw logging on
tail -f /var/log/ufw.log

iptables β€” you need to read it, not memorize it:

iptables -L -n -v           # list all rules with packet/byte counts
iptables -L INPUT -n -v     # just INPUT chain
iptables -t nat -L -n       # NAT table

UFW uses iptables underneath. When UFW allows port 80, it adds an iptables ACCEPT rule. This is how to answer "how does UFW work under the hood?"

nftables:

nft list ruleset    # view current rules

Network diagnostics:

ping -c 4 8.8.8.8                    # basic reachability
traceroute 8.8.8.8                   # hop-by-hop path
mtr 8.8.8.8                         # traceroute + ping combined, live
ss -s                                # socket statistics summary
ss -tnp state established            # established TCP connections

Sunday: Build the networking diagnostics script

This goes on GitHub as a real project.

endpoint-monitor.py β€” check a list of endpoints and report health:

#!/usr/bin/env python3
"""
Endpoint health monitor β€” checks DNS, HTTP reachability, and SSL cert expiry.
Usage: python3 endpoint-monitor.py --config endpoints.yaml [--json]
"""
import argparse, json, socket, ssl, datetime, sys
import urllib.request, urllib.error
import yaml

def check_dns(hostname):
    try:
        ip = socket.gethostbyname(hostname)
        return {"status": "ok", "ip": ip}
    except socket.gaierror as e:
        return {"status": "error", "error": str(e)}

def check_http(url, timeout=10):
    try:
        req = urllib.request.Request(url, headers={"User-Agent": "endpoint-monitor/1.0"})
        with urllib.request.urlopen(req, timeout=timeout) as r:
            return {"status": "ok", "http_code": r.status, "latency_ms": None}
    except urllib.error.HTTPError as e:
        return {"status": "error", "http_code": e.code}
    except Exception as e:
        return {"status": "error", "error": str(e)}

def check_ssl(hostname, port=443):
    try:
        ctx = ssl.create_default_context()
        with ctx.wrap_socket(socket.create_connection((hostname, port), timeout=10),
                             server_hostname=hostname) as s:
            cert = s.getpeercert()
            expiry_str = cert['notAfter']
            expiry = datetime.datetime.strptime(expiry_str, "%b %d %H:%M:%S %Y %Z")
            days_left = (expiry - datetime.datetime.utcnow()).days
            return {"status": "ok", "expires": expiry_str, "days_remaining": days_left}
    except Exception as e:
        return {"status": "error", "error": str(e)}

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--config", required=True)
    parser.add_argument("--json", action="store_true")
    args = parser.parse_args()

    with open(args.config) as f:
        config = yaml.safe_load(f)

    results = {}
    for endpoint in config.get("endpoints", []):
        name = endpoint["name"]
        url = endpoint["url"]
        hostname = url.split("//")[-1].split("/")[0]
        results[name] = {
            "dns": check_dns(hostname),
            "http": check_http(url),
            "ssl": check_ssl(hostname) if url.startswith("https") else None,
        }

    if args.json:
        print(json.dumps(results, indent=2))
    else:
        for name, checks in results.items():
            print(f"\n{name}:")
            for check_name, result in checks.items():
                if result:
                    status = "βœ“" if result["status"] == "ok" else "βœ—"
                    print(f"  {status} {check_name}: {result}")

    any_failure = any(
        c["status"] == "error"
        for r in results.values()
        for c in r.values() if c
    )
    sys.exit(1 if any_failure else 0)

if __name__ == "__main__":
    main()

endpoints.yaml:

endpoints:
  - name: FactorSphere
    url: https://factorsphere.org
  - name: FactorSphere API
    url: https://api.factorsphere.org
  - name: VPS
    url: http://<your-vps-ip>:8080

Deliverables, Week 2:

  • ops-scripts repo updated: endpoint-monitor.py added with endpoints.yaml.example and updated README
  • Can verbally answer in an interview: "Walk me through what happens when a user types factorsphere.org and hits Enter" β€” from DNS query through TLS through Cloudflare edge to Worker response
  • VPS subdomain configured (if you have a domain) and verified with dig

Week 3 β€” Python DevOps Scripting

What this week unlocks: Python automation is the most common take-home task format in NCR DevOps interviews. By the end of this week you have a GitHub repo with real scripts and can complete a take-home assignment in 2 hours rather than 4.

Study hours: 42

Learning objectives:

  • subprocess: running shell commands from Python, capturing output, handling return codes
  • os/sys: environment variables, path operations, argument handling
  • argparse: building proper CLI tools with flags and help text
  • requests: HTTP calls, error handling, timeouts, retries
  • yaml/json: config parsing, output generation
  • Writing scripts that do real infrastructure work β€” deploy checks, API monitors, config validators, log parsers

Monday–Tuesday: subprocess + os + sys + argparse

subprocess β€” the right way:

import subprocess

# Run a command, capture output, check return code
result = subprocess.run(
    ["systemctl", "is-active", "nginx"],
    capture_output=True,
    text=True,
    timeout=10
)
print(result.stdout.strip())   # "active" or "inactive"
print(result.returncode)       # 0 = active, 3 = inactive

# Run shell command (avoid when possible β€” harder to escape safely)
result = subprocess.run(
    "df -h | grep '/$'",
    shell=True, capture_output=True, text=True
)

# Raise on non-zero exit
subprocess.run(["docker", "ps"], check=True)  # raises CalledProcessError if docker fails

os and sys:

import os, sys

# Environment variables
api_key = os.environ.get("CF_API_KEY")  # returns None if not set, no KeyError
api_key = os.environ["CF_API_KEY"]      # raises KeyError if not set β€” use when required

# Paths
os.path.join("/var/log", "nginx", "access.log")
os.path.exists("/etc/nginx/nginx.conf")
os.path.abspath("../config")

# Script directory (useful for loading config files relative to script)
script_dir = os.path.dirname(os.path.abspath(__file__))
config_path = os.path.join(script_dir, "config.yaml")

# Exit with status code (important for shell scripts that call your Python)
sys.exit(0)   # success
sys.exit(1)   # failure

argparse β€” build a real CLI:

import argparse

def main():
    parser = argparse.ArgumentParser(
        description="Check service health on a VPS",
        formatter_class=argparse.RawDescriptionHelpFormatter
    )
    parser.add_argument("--host", required=True, help="VPS hostname or IP")
    parser.add_argument("--port", type=int, default=8080, help="Health endpoint port")
    parser.add_argument("--service", action="append", dest="services",
                        help="systemd service to check (repeat for multiple)")
    parser.add_argument("--verbose", "-v", action="store_true")
    parser.add_argument("--output-format", choices=["text", "json"], default="text")
    args = parser.parse_args()
    # args.host, args.port, args.services, args.verbose, args.output_format

Build deploy-check.py β€” takes a service name as argument, checks it's active, verifies the port is listening, hits the health endpoint, prints pass/fail with exit code:

#!/usr/bin/env python3
"""
Post-deploy smoke test: checks systemd service, port, and HTTP health endpoint.
Usage: python3 deploy-check.py --service nginx --port 80 --endpoint /health
"""
import argparse, subprocess, socket, sys
import requests

def check_service(name):
    r = subprocess.run(["systemctl", "is-active", name],
                       capture_output=True, text=True)
    return r.stdout.strip() == "active"

def check_port(port, host="localhost"):
    try:
        with socket.create_connection((host, port), timeout=5):
            return True
    except (socket.timeout, ConnectionRefusedError, OSError):
        return False

def check_http(url, timeout=10):
    try:
        r = requests.get(url, timeout=timeout)
        return r.status_code < 500, r.status_code
    except requests.exceptions.RequestException as e:
        return False, str(e)

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--service", required=True)
    parser.add_argument("--port", type=int, required=True)
    parser.add_argument("--endpoint", default="/health")
    parser.add_argument("--host", default="localhost")
    args = parser.parse_args()

    checks = [
        ("service active", check_service(args.service)),
        ("port listening", check_port(args.port, args.host)),
    ]
    http_ok, http_detail = check_http(f"http://{args.host}:{args.port}{args.endpoint}")
    checks.append((f"HTTP {args.endpoint}", http_ok))

    passed = all(ok for _, ok in checks)
    for name, ok in checks:
        status = "PASS" if ok else "FAIL"
        print(f"[{status}] {name}")

    sys.exit(0 if passed else 1)

if __name__ == "__main__":
    main()

Wednesday: requests + API interaction

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

# Basic
r = requests.get("https://api.example.com/status", timeout=10)
r.raise_for_status()   # raises HTTPError for 4xx/5xx
data = r.json()

# With headers
r = requests.get(
    "https://api.cloudflare.com/client/v4/accounts",
    headers={"Authorization": f"Bearer {os.environ['CF_API_TOKEN']}"},
    timeout=10
)

# Retry logic
session = requests.Session()
retry = Retry(total=3, backoff_factor=1, status_forcelist=[500, 502, 503, 504])
adapter = HTTPAdapter(max_retries=retry)
session.mount("https://", adapter)
r = session.get("https://api.example.com", timeout=10)

Build cloudflare-deploy-monitor.py β€” checks the status of the most recent FactorSphere Workers deployment via the Cloudflare API. Reads CF_API_TOKEN and CF_ACCOUNT_ID from environment variables. Outputs the deployment status and timestamp. Returns exit code 1 if the last deployment failed. This is a real script that interacts with a real production system.

Cloudflare API endpoint to use: GET /accounts/{account_id}/workers/scripts/{script_name}/deployments (check current CF API docs for exact path β€” the concept is what matters here, the implementation requires your actual credentials).


Thursday: YAML/JSON parsing

import yaml, json

# YAML
with open("config.yaml") as f:
    config = yaml.safe_load(f)   # safe_load, not load β€” avoids arbitrary code execution

# JSON
with open("output.json") as f:
    data = json.load(f)

# Pretty print
print(json.dumps(data, indent=2, default=str))   # default=str handles datetime objects

# Write YAML
with open("generated-config.yaml", "w") as f:
    yaml.dump(config, f, default_flow_style=False)

Build config-validator.py β€” reads a YAML deployment config, validates required keys exist and have the right types, outputs errors if invalid, exits 0 if valid:

REQUIRED_FIELDS = {
    "service_name": str,
    "image": str,
    "port": int,
    "health_endpoint": str,
    "environment": dict,
}

This is a real pattern β€” CI pipelines often validate config before proceeding.


Friday–Saturday: Full ops script

Build vps-health-reporter.py β€” this is the Week 3 anchor deliverable. It's a single script that does everything:

Usage: python3 vps-health-reporter.py [--json] [--verbose] [--output FILE]

Checks:
  - Disk usage per filesystem (warns if > configurable threshold)
  - Memory usage
  - systemd services (configured list)
  - Port reachability (configured list)
  - HTTP endpoints (configured list with expected status codes)
  - SSL cert expiry for HTTPS endpoints (warns if < 30 days)

Output:
  - Text table by default
  - JSON with --json flag
  - Writes to file with --output flag
  - Exit code 0 if all checks pass, 1 if any fail

Config: reads from vps-health-reporter.yaml:

services:
  - nginx
  - healthapi
  - docker
ports:
  - host: localhost
    port: 80
    name: nginx-http
  - host: localhost
    port: 8080
    name: healthapi
endpoints:
  - url: http://localhost:8080
    expected_status: 200
    name: healthapi-root
  - url: https://factorsphere.org
    expected_status: 200
    name: factorsphere
disk_threshold_pct: 80
memory_threshold_pct: 90
ssl_warning_days: 30

This script uses subprocess, requests, ssl, socket, yaml, json, argparse, sys, os. It solves a real problem β€” paste it into any VPS and get a health report.


Sunday: Polish and documentation

requirements.txt:

requests==2.31.0
PyYAML==6.0

Meaningful commit history β€” if all your commits are add scripts, you're doing it wrong. Examples of correct commit messages:

feat(deploy-check): add HTTP health endpoint validation
fix(endpoint-monitor): handle SSL cert expiry for non-HTTPS endpoints gracefully
feat(vps-reporter): add configurable disk/memory thresholds from YAML config
refactor(cloudflare-monitor): extract API client to reusable class

README for the repo: what problem each script solves, prerequisites, installation (pip install -r requirements.txt), example usage for each script, example output.

Deliverables, Week 3:

  • ops-scripts repo: 5+ scripts (health-check.sh, backup.sh, log-analyzer.sh, endpoint-monitor.py, deploy-check.py, cloudflare-deploy-monitor.py, config-validator.py, vps-health-reporter.py), requirements.txt, endpoints.yaml.example, vps-health-reporter.yaml.example, clean README, 20+ meaningful commits

Week 4 β€” Docker (Production) + Docker Compose + Application Strategy

What this week unlocks: Docker and Compose are tested in almost every junior DevOps interview in NCR. By the end of this week you have a real multi-service stack running on the VPS β€” something you can show and explain in a technical round. Sunday is the application strategy session.

Study hours: 42

Learning objectives:

  • Multi-stage Dockerfiles: how and why, not just what
  • Image optimization: layer caching order, .dockerignore, minimal base images
  • Container networking: how containers resolve each other by name in Compose
  • Volume management: named volumes vs bind mounts, when each is appropriate
  • Health checks: HEALTHCHECK instruction, container health states, depends_on conditions
  • Running Docker Compose stacks as persistent services on VPS
  • Docker Compose: full multi-service stack, .env files, override files, Makefile

Monday–Tuesday: Production Dockerfile patterns

You've used Docker. These are the patterns that distinguish junior from intern-level usage.

Multi-stage build for a Python app:

# syntax=docker/dockerfile:1

# ── Stage 1: builder ──────────────────────────────────────────
FROM python:3.11-slim AS builder

WORKDIR /build

# Copy only requirements first β€” Docker caches this layer
# If requirements.txt doesn't change, this layer is reused on rebuild
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

# ── Stage 2: runtime ──────────────────────────────────────────
FROM python:3.11-slim AS runtime

# Create non-root user
RUN useradd --create-home --shell /bin/bash appuser

WORKDIR /app

# Copy only the installed packages from builder
COPY --from=builder /root/.local /home/appuser/.local
COPY --chown=appuser:appuser . .

USER appuser

# PATH must include user-installed packages
ENV PATH=/home/appuser/.local/bin:$PATH

EXPOSE 5000

HEALTHCHECK --interval=30s --timeout=10s --start-period=10s --retries=3 \
    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:5000/health')" || exit 1

CMD ["python", "-m", "gunicorn", "--bind", "0.0.0.0:5000", "--workers", "2", "app:app"]

.dockerignore:

.git
.gitignore
__pycache__
*.pyc
*.pyo
.pytest_cache
.env
*.env
node_modules
.venv
dist
build
*.egg-info
README.md
docs/

Why layer caching order matters: COPY requirements.txt . + RUN pip install before COPY . . means requirements are cached as long as requirements.txt doesn't change. If you COPY . . first, every file change invalidates the pip install layer. Run docker build twice β€” second run should show CACHED for the pip layer.

Compare image sizes:

docker images | grep myapp
# naive (python:3.11): ~1.1GB
# multi-stage (python:3.11-slim): ~200MB
# distroless: ~100MB

Container networking:

docker network create mynet
docker run -d --name db --network mynet postgres:15
docker run -it --network mynet alpine ping db   # resolves by container name
docker inspect mynet   # see connected containers and IP assignments

Wednesday: Docker Compose

Build a docker-compose.yml for a real multi-service stack:

version: "3.9"

services:
  api:
    build:
      context: ./api
      dockerfile: Dockerfile
    image: myapi:latest
    container_name: myapi
    ports:
      - "5000:5000"
    environment:
      - DATABASE_URL=postgresql://postgres:${POSTGRES_PASSWORD}@db:5432/appdb
      - REDIS_URL=redis://cache:6379/0
      - SECRET_KEY=${SECRET_KEY}
    depends_on:
      db:
        condition: service_healthy
      cache:
        condition: service_started
    restart: unless-stopped
    networks:
      - backend

  db:
    image: postgres:15-alpine
    container_name: mydb
    environment:
      POSTGRES_DB: appdb
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./db/init.sql:/docker-entrypoint-initdb.d/init.sql:ro
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres -d appdb"]
      interval: 10s
      timeout: 5s
      retries: 5
    restart: unless-stopped
    networks:
      - backend

  cache:
    image: redis:7-alpine
    container_name: myredis
    command: redis-server --appendonly yes
    volumes:
      - redis_data:/data
    restart: unless-stopped
    networks:
      - backend

volumes:
  postgres_data:
  redis_data:

networks:
  backend:
    driver: bridge

.env (never commit this β€” commit .env.example):

POSTGRES_PASSWORD=changeme_in_production
SECRET_KEY=changeme_in_production

Override for development β€” docker-compose.override.yml (only loaded locally, not in CI):

services:
  api:
    volumes:
      - ./api:/app   # bind mount for hot reload in dev
    environment:
      - DEBUG=true

Production doesn't have the override file, so bind mounts don't exist in prod.


Thursday: Containers as systemd services + resource limits

Running Docker Compose as a systemd service on the VPS:

/etc/systemd/system/myapp.service:

[Unit]
Description=MyApp Docker Compose Stack
Requires=docker.service
After=docker.service network-online.target
Wants=network-online.target

[Service]
Type=oneshot
RemainAfterExit=yes
WorkingDirectory=/opt/myapp
ExecStart=/usr/bin/docker compose up -d --remove-orphans
ExecStop=/usr/bin/docker compose down
TimeoutStartSec=120
TimeoutStopSec=30
User=deploy

[Install]
WantedBy=multi-user.target

Resource limits in Compose:

services:
  api:
    deploy:
      resources:
        limits:
          cpus: "0.5"
          memory: 256M
        reservations:
          memory: 128M

Log management β€” prevent containers from filling your disk:

services:
  api:
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"

Friday–Saturday: Build the anchor project β€” vps-multiservice

This is the Week 4 portfolio project. Build it properly.

Project: Multi-service API stack on VPS

Repository: vps-multiservice

Structure:

vps-multiservice/
β”œβ”€β”€ api/
β”‚   β”œβ”€β”€ app.py
β”‚   β”œβ”€β”€ Dockerfile
β”‚   └── requirements.txt
β”œβ”€β”€ db/
β”‚   └── init.sql
β”œβ”€β”€ nginx/                    # placeholder config β€” Week 5 replaces this
β”‚   └── default.conf
β”œβ”€β”€ docker-compose.yml
β”œβ”€β”€ docker-compose.override.yml
β”œβ”€β”€ .env.example
β”œβ”€β”€ Makefile
└── README.md

api/app.py β€” a real Flask API, not hello-world:

from flask import Flask, jsonify
import psycopg2, redis, os, time

app = Flask(__name__)

def get_db():
    return psycopg2.connect(os.environ["DATABASE_URL"])

def get_redis():
    url = os.environ.get("REDIS_URL", "redis://cache:6379/0")
    return redis.from_url(url)

@app.route("/health")
def health():
    checks = {}
    # Check Postgres
    try:
        conn = get_db()
        conn.close()
        checks["database"] = "ok"
    except Exception as e:
        checks["database"] = f"error: {e}"
    # Check Redis
    try:
        r = get_redis()
        r.ping()
        checks["cache"] = "ok"
    except Exception as e:
        checks["cache"] = f"error: {e}"

    all_ok = all(v == "ok" for v in checks.values())
    return jsonify({"status": "ok" if all_ok else "degraded", "checks": checks}), \
           200 if all_ok else 503

@app.route("/info")
def info():
    return jsonify({
        "service": "vps-multiservice-api",
        "version": os.environ.get("APP_VERSION", "dev"),
        "uptime": time.time()
    })

@app.route("/cache/set/<key>/<value>")
def cache_set(key, value):
    r = get_redis()
    r.setex(key, 300, value)
    return jsonify({"stored": key})

@app.route("/cache/get/<key>")
def cache_get(key):
    r = get_redis()
    value = r.get(key)
    return jsonify({"key": key, "value": value.decode() if value else None})

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000)

Makefile:

.PHONY: up down logs ps build restart deploy

up:
	docker compose up -d --build

down:
	docker compose down

logs:
	docker compose logs -f

ps:
	docker compose ps

build:
	docker compose build --no-cache

restart:
	docker compose restart

health:
	curl -s http://localhost:5000/health | python3 -m json.tool

shell-api:
	docker compose exec api /bin/bash

shell-db:
	docker compose exec db psql -U postgres -d appdb

README.md must answer: what this project is, what services it runs, how to start it locally, how to deploy to production, what the health endpoint checks, what environment variables are required, what the architecture looks like (diagram), and what decisions were made (why named volumes over bind mounts, why service_healthy condition on depends_on, why non-root user in Dockerfile).

Deploy to the Hetzner VPS: clone the repo, create .env from .env.example, make up, verify make health returns 200. Configure the systemd service so it starts on boot.


Sunday: Application strategy

This is the only Sunday in Phase 1 not dedicated to pure technical work. Block the full day.

Resume β€” one page, PDF:

Header: your name, email, GitHub URL, LinkedIn URL, location (Delhi NCR), phone.

Title: Junior DevOps & Infrastructure Engineer

Experience section:

Software Developer Intern β€” [Antler-backed venture studio]  [dates]
β€’ Shipped production SaaS MVPs across multiple projects; owned Docker-based
  deployment pipelines and CI/CD configuration for 3+ products
β€’ Implemented GitHub Actions workflows for automated testing and deployment
β€’ Worked across time zones with distributed team

Projects section (this is more important than the internship for a DevOps role):

FactorSphere (factorsphere.org) β€” Production Edge Platform
β€’ Live academic journal ranking platform with 4,000+ journals, real users,
  Google-indexed; won college project exhibition
β€’ Serverless microservices on Cloudflare Workers (analogous to AWS Lambda +
  API Gateway), Pinecone vector database, LLM inference pipeline
β€’ CI/CD: GitHub Actions β†’ Wrangler β†’ Cloudflare edge deployment
β€’ Architecture: fully edge-native; documented in ARCHITECTURE.md on GitHub

VPS Multi-Service Stack (github.com/you/vps-multiservice)
β€’ Multi-service API stack on Ubuntu VPS: Flask + Redis + PostgreSQL
β€’ Multi-stage Docker builds, Docker Compose, named volumes, health checks
β€’ systemd service management, UFW firewall configuration, SSH hardening

Skills section:

Infrastructure: Linux (Ubuntu/Arch), Docker, Docker Compose, systemd, UFW, SSH
Scripting: Python (subprocess, requests, argparse, YAML/JSON), Bash
CI/CD: GitHub Actions, Wrangler (Cloudflare)
Networking: DNS, HTTP/S, TCP/IP, SSL/TLS, reverse proxy concepts
Observability: log analysis (journalctl), endpoint monitoring
Platforms: Cloudflare Workers/Pages, Pinecone, Hetzner VPS
Version Control: Git (branching, rebasing, structured commits)

Do not list technologies you can't defend in a 5-minute conversation. If Kubernetes is not on your CV, a recruiter won't ask about it. If it is, they will.

LinkedIn:

  • Headline: Junior DevOps Engineer | Cloudflare Workers | Docker | Python | Delhi NCR
  • About section: 3 sentences. "CS grad with production experience at an Antler-backed venture studio. Built and deployed FactorSphere (factorsphere.org), a live platform running on a serverless edge architecture. Targeting junior DevOps and cloud infrastructure roles in Delhi NCR."
  • Add all projects with links
  • Enable "Open to Work" (visible to recruiters, not your network if you prefer)

Job titles to search:

  • "DevOps Engineer" + fresher/junior/0-2 years
  • "Cloud Support Engineer"
  • "Infrastructure Engineer"
  • "Site Reliability Engineer" (rare at fresher level but exists)
  • "Systems Engineer" (often infrastructure work at service companies)

Platforms, in priority order:

  1. LinkedIn β€” set job alerts for each title + Delhi, Gurgaon, Noida
  2. Naukri.com β€” mandatory for NCR; service companies and GCCs post exclusively here
  3. Instahyre β€” funded product companies
  4. Wellfound (AngelList) β€” funded startups
  5. Company career pages directly: Nagarro, GlobalLogic, Publicis Sapient, Genpact, NIIT Technologies, Mphasis

Outreach message template (LinkedIn β€” max 5 lines):

Hi [Name] β€” I'm a CS grad with production experience shipping SaaS MVPs at an Antler-backed venture studio, including a live platform (factorsphere.org) running on a serverless edge architecture with Docker, CI/CD, and Python scripting across projects. I'm targeting junior DevOps roles in NCR and noticed [Company] works on [relevant tech or cloud platform from their JD]. Would it be appropriate to share my CV directly?

Send this to: DevOps leads, engineering managers, or HR at target companies. Not recruiters at staffing agencies (waste of time for this profile). Target 10 outreach messages in the first week of applying.

By Sunday evening, completed:

  • Resume PDF finalized
  • LinkedIn updated
  • Naukri profile created with correct resume
  • 5 job applications submitted
  • 5 outreach messages sent on LinkedIn

Deliverables, Week 4:

  • vps-multiservice repo: running on VPS, full README, Makefile, .env.example, meaningful commit history showing incremental development
  • ops-scripts repo: polished with all Week 1–3 scripts, clean README
  • endpoint-monitor.py and vps-health-reporter.py running and documented
  • FactorSphere: docs/ARCHITECTURE.md and docs/CICD.md committed
  • Resume PDF (one page)
  • LinkedIn updated
  • 5+ applications submitted, 5+ outreach messages sent
  • Job alert set on LinkedIn and Naukri

PHASE 2: Weeks 5–12 β€” Become Hireable by Stronger Companies


Week 5 β€” Nginx

What this week unlocks: Reverse proxy knowledge is expected at junior level. SSL termination and upstream configuration come up in almost every DevOps technical round. You also get HTTPS on your VPS project β€” a visible signal of operational maturity.

Study hours: 42

Learning objectives: Virtual hosts, reverse proxy, SSL termination with Let's Encrypt, load balancing upstream blocks, rate limiting, security headers, static file serving


Monday: Installation and configuration structure

apt install nginx
systemctl enable nginx
systemctl start nginx
nginx -v

Configuration structure on Ubuntu:

  • /etc/nginx/nginx.conf β€” main config, defines worker_processes, events, and the http block
  • /etc/nginx/sites-available/ β€” your server blocks live here
  • /etc/nginx/sites-enabled/ β€” symlinks to sites-available for active configs
  • nginx -t β€” test config syntax; always run before systemctl reload nginx
  • systemctl reload nginx β€” graceful reload (no dropped connections) vs restart (all connections dropped)

Read /etc/nginx/nginx.conf. Understand: worker_processes auto uses all cores, worker_connections 1024 limits connections per worker, the include /etc/nginx/sites-enabled/* line.


Tuesday–Wednesday: Reverse proxy + SSL

Create /etc/nginx/sites-available/myapp:

server {
    listen 80;
    server_name vps.yourdomain.com;

    location / {
        proxy_pass http://127.0.0.1:5000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_connect_timeout 30s;
        proxy_read_timeout 30s;
        proxy_buffering on;
    }

    location /health {
        proxy_pass http://127.0.0.1:5000/health;
        access_log off;   # don't pollute logs with health checks
    }
}
ln -s /etc/nginx/sites-available/myapp /etc/nginx/sites-enabled/
nginx -t
systemctl reload nginx

SSL with Certbot:

apt install certbot python3-certbot-nginx
certbot --nginx -d vps.yourdomain.com
# Follow prompts: enter email, agree to TOS, choose redirect HTTP→HTTPS
certbot renew --dry-run   # verify auto-renewal works

After Certbot runs, inspect /etc/nginx/sites-available/myapp β€” Certbot modifies it. The config now has a listen 443 ssl block and an HTTP-to-HTTPS redirect. Read and understand what was added.

Auto-renewal is handled by a systemd timer installed by Certbot: systemctl status certbot.timer.


Thursday: Multiple virtual hosts + static files

Three server blocks:

  1. docs.yourdomain.com β†’ serves static HTML files from /var/www/docs/
  2. api.yourdomain.com β†’ reverse proxy to Flask API on port 5000
  3. Default catch-all β†’ returns 444 (nginx closes connection without response)

Static file serving:

server {
    listen 443 ssl;
    server_name docs.yourdomain.com;
    # (SSL certs added by certbot)

    root /var/www/docs;
    index index.html;

    location / {
        try_files $uri $uri/ =404;
    }

    # Cache static assets
    location ~* \.(css|js|png|jpg|ico)$ {
        expires 30d;
        add_header Cache-Control "public, immutable";
    }
}

Default catch-all:

server {
    listen 80 default_server;
    listen 443 ssl default_server;
    server_name _;
    ssl_certificate /etc/ssl/certs/ssl-cert-snakeoil.pem;
    ssl_certificate_key /etc/ssl/private/ssl-cert-snakeoil.key;
    return 444;
}

Friday–Saturday: Load balancing + rate limiting + security headers

Upstream block for load balancing:

upstream api_backends {
    least_conn;   # route to server with fewest active connections
    server 127.0.0.1:5000 weight=2;
    server 127.0.0.1:5001 weight=1;   # start a second Flask instance for this exercise
    keepalive 32;
}

server {
    location / {
        proxy_pass http://api_backends;
    }
}

Rate limiting:

# In http block (nginx.conf or a conf.d include):
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;
limit_req_zone $binary_remote_addr zone=health_limit:1m rate=60r/m;

# In server block:
location /api/ {
    limit_req zone=api_limit burst=20 nodelay;
    proxy_pass http://api_backends;
}

Security headers:

add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-XSS-Protection "1; mode=block" always;
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
add_header Referrer-Policy "strict-origin-when-cross-origin" always;

Sunday: Update vps-multiservice

Replace the placeholder Nginx container in Docker Compose with a reference to the host Nginx. Update README.md:

  • Architecture diagram showing: internet β†’ Nginx (host, SSL) β†’ Docker network β†’ Flask API container
  • SSL configuration documented
  • How to reproduce (Certbot commands, nginx site config)

Update the vps-multiservice README with a "Production Deployment" section showing the full stack.


Week 6 β€” GitHub Actions for VPS

What this week unlocks: CI/CD for VPS-hosted projects is the most visible resume signal. This pipeline demonstrates that your Docker Compose stack is managed like production infrastructure, not run manually. Combined with FactorSphere's existing CI/CD, you can now speak to two different CI/CD patterns.

Study hours: 42


Monday–Tuesday: GitHub Actions syntax

Create .github/workflows/deploy.yml in vps-multiservice:

name: Deploy to VPS

on:
  push:
    branches: [main]
  workflow_dispatch:   # allow manual trigger from GitHub UI

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Validate docker-compose
        run: docker compose config

  deploy:
    needs: test
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4

      - name: Deploy to VPS
        uses: appleboy/ssh-action@v1.0.0
        with:
          host: ${{ secrets.VPS_HOST }}
          username: deploy
          key: ${{ secrets.VPS_SSH_KEY }}
          script: |
            cd /opt/myapp
            git pull origin main
            docker compose pull
            docker compose up -d --build --remove-orphans
            docker system prune -f

      - name: Smoke test
        run: |
          sleep 10
          curl -f https://vps.yourdomain.com/health || exit 1

GitHub Secrets to configure (Settings β†’ Secrets and Variables β†’ Actions):

  • VPS_HOST: your VPS IP
  • VPS_SSH_KEY: contents of a dedicated deploy private key (generate a new ed25519 key specifically for GitHub Actions, add the public key to ~/.ssh/authorized_keys on the VPS)

Never put the private key in the repo or hardcode it in the workflow.


Wednesday–Thursday: Zero-downtime consideration + notifications

The naive docker compose up -d --build restarts containers, causing brief downtime. For a portfolio project this is acceptable. Document this limitation in the README and explain what a production mitigation looks like (blue-green deployment, rolling update in Kubernetes, or a health-check grace period with a load balancer).

Add failure notification via GitHub's built-in email (no setup required β€” GitHub emails you when a workflow fails on default).

Add a status badge to the README:

![Deploy](https://github.com/yourusername/vps-multiservice/actions/workflows/deploy.yml/badge.svg)

Friday–Saturday: Makefile deployment tooling

Add to the Makefile:

deploy:
	@echo "Deploying to VPS..."
	ssh deploy@$(VPS_HOST) "cd /opt/myapp && git pull && docker compose up -d --build"

rollback:
	@echo "Rolling back to previous image..."
	ssh deploy@$(VPS_HOST) "cd /opt/myapp && git checkout HEAD~1 && docker compose up -d"

status:
	ssh deploy@$(VPS_HOST) "docker compose ps && curl -s http://localhost:5000/health"

logs-prod:
	ssh deploy@$(VPS_HOST) "docker compose logs -f --tail=100"

Usage with make VPS_HOST=<ip> deploy or set VPS_HOST in a local .env.make (not committed).


Sunday: FactorSphere CI/CD comparison

Update docs/CICD.md in FactorSphere to add a comparison section:

## CI/CD Pattern Comparison

FactorSphere (Cloudflare Workers):
  Trigger: push to main
  Pipeline: GitHub Actions β†’ wrangler deploy β†’ Cloudflare edge network
  Rollback: wrangler rollback <deployment-id> or git revert + push
  State: stateless Workers, no server to manage

vps-multiservice (Docker Compose + SSH):
  Trigger: push to main
  Pipeline: GitHub Actions β†’ SSH β†’ git pull β†’ docker compose up
  Rollback: git checkout HEAD~1 + docker compose up (or tag-based rollback)
  State: stateful services (Postgres data in named volume), must manage carefully

Being able to explain this comparison β€” why each pattern exists, what tradeoffs each makes β€” is a strong interview signal.


Weeks 7–9 β€” AWS (EC2, IAM, S3, VPC, CloudWatch)

Three weeks of AWS. You are building toward a single coherent project: the same multi-service stack, now running on EC2, with IAM roles, S3 backups, VPC networking, and CloudWatch logging.


Week 7 β€” EC2 + IAM

What this week unlocks: AWS is on your CV with real hands-on evidence. Most NCR product companies and GCCs require at least basic AWS. After this week you can pass a cloud support phone screen.

Learning objectives: EC2 launch and management, SSH key pairs, security groups, IAM users/roles/policies, principle of least privilege, AWS CLI, instance profiles


Monday–Tuesday: AWS account + IAM baseline

Create an AWS account (personal email, not institutional). Immediately:

  1. Enable MFA on the root account
  2. Create a billing alarm: CloudWatch β†’ Alarms β†’ Create Alarm β†’ Billing β†’ Total Estimated Charge β†’ threshold $5 β†’ email notification. This protects against accidentally leaving expensive resources running.
  3. Create an IAM user for yourself: yourname-admin, AdministratorAccess policy, programmatic + console access, enable MFA
  4. Never use root again

Configure AWS CLI:

aws configure
# AWS Access Key ID: [from IAM user]
# AWS Secret Access Key: [from IAM user]
# Default region: ap-south-1   (Mumbai β€” lowest latency from Delhi)
# Default output format: json

aws sts get-caller-identity   # verify who you're authenticated as
aws iam get-user              # verify correct user

IAM β€” understand these three things cold:

  • User: a person or application with long-term credentials (access keys)
  • Role: an identity assumed temporarily; no long-term credentials; assumed by EC2 instances, Lambda, other services
  • Policy: a JSON document that defines what actions are allowed on which resources

Create a principle of least privilege policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::my-specific-bucket",
        "arn:aws:s3:::my-specific-bucket/*"
      ]
    }
  ]
}

This is more secure than "Action": "s3:*" and more secure than "Resource": "*". Be ready to explain why.


Wednesday–Thursday: EC2 launch and management

Launch a t2.micro (free tier) with Ubuntu 22.04 in ap-south-1:

  • AMI: Ubuntu 22.04 LTS (search "ubuntu 22.04 hvm" in Community AMIs or use the Quick Start)
  • Instance type: t2.micro (free tier eligible)
  • Key pair: create a new key pair named ec2-deploy-key, download the .pem
  • Security group: create web-sg
    • Inbound: SSH (22) from your IP only (not 0.0.0.0/0), HTTP (80) from 0.0.0.0/0, HTTPS (443) from 0.0.0.0/0
    • Outbound: all traffic (default)
  • Storage: 8GB gp3 (free tier)

SSH in:

chmod 400 ec2-deploy-key.pem
ssh -i ec2-deploy-key.pem ubuntu@<ec2-public-ip>

Install Docker on the EC2:

apt update
apt install -y docker.io docker-compose-plugin
usermod -aG docker ubuntu
systemctl enable docker
systemctl start docker

User data bootstrap β€” relaunch with a user data script that installs Docker automatically. In the EC2 launch wizard, under "Advanced Details" β†’ "User data":

#!/bin/bash
apt-get update -y
apt-get install -y docker.io docker-compose-plugin git
usermod -aG docker ubuntu
systemctl enable docker
systemctl start docker

This is how you automate instance bootstrap β€” important concept for EC2.


Friday–Saturday: IAM role + instance profile

Create an IAM role for EC2 to access S3:

  1. IAM β†’ Roles β†’ Create role
  2. Trusted entity: AWS service β†’ EC2
  3. Policy: create inline policy:
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": ["s3:PutObject", "s3:GetObject", "s3:ListBucket"],
    "Resource": [
      "arn:aws:s3:::your-backup-bucket",
      "arn:aws:s3:::your-backup-bucket/*"
    ]
  }]
}
  1. Name the role ec2-app-role
  2. Attach the role to your EC2 instance: EC2 β†’ Actions β†’ Security β†’ Modify IAM role

Now on the EC2, without any credentials:

aws s3 ls s3://your-backup-bucket   # works because instance has IAM role
aws sts get-caller-identity          # shows the assumed-role identity

This is the correct way to give EC2 access to AWS services β€” not by putting access keys on the instance. Be ready to explain why: if the instance is compromised, long-term credentials in environment variables are exfiltrated; a role is temporary and scoped.

Deploy vps-multiservice to EC2: clone the repo, create .env, docker compose up -d. Verify curl http://<ec2-ip>:5000/health returns 200.


Week 8 β€” S3 + VPC

Learning objectives: S3 bucket operations, IAM policies for S3, static hosting, lifecycle policies, presigned URLs, boto3; VPC fundamentals β€” subnets, route tables, internet gateway, security groups vs NACLs


Monday–Tuesday: S3

# Create bucket
aws s3 mb s3://your-backup-bucket-$(date +%s) --region ap-south-1

# Upload and download
aws s3 cp backup.tar.gz s3://your-backup-bucket/backups/backup.tar.gz
aws s3 sync ./logs/ s3://your-backup-bucket/logs/
aws s3 ls s3://your-backup-bucket/ --recursive

# Static website hosting
aws s3 mb s3://your-docs-site
aws s3 website s3://your-docs-site --index-document index.html --error-document 404.html
aws s3 sync ./docs/ s3://your-docs-site --acl public-read

Lifecycle policy (JSON):

{
  "Rules": [{
    "ID": "archive-old-backups",
    "Status": "Enabled",
    "Filter": {"Prefix": "backups/"},
    "Transitions": [{
      "Days": 30,
      "StorageClass": "GLACIER"
    }],
    "Expiration": {"Days": 365}
  }]
}
aws s3api put-bucket-lifecycle-configuration \
  --bucket your-backup-bucket \
  --lifecycle-configuration file://lifecycle.json

Presigned URLs with Python (boto3):

import boto3, os

s3 = boto3.client("s3", region_name="ap-south-1")

# Generate URL that expires in 1 hour
url = s3.generate_presigned_url(
    "get_object",
    Params={"Bucket": "your-backup-bucket", "Key": "backups/backup.tar.gz"},
    ExpiresIn=3600
)
print(url)  # anyone with this URL can download for 1 hour

Update backup.sh to upload to S3 after creating the local archive:

# At end of backup.sh:
if command -v aws &>/dev/null; then
    aws s3 cp "$ARCHIVE" "s3://${S3_BUCKET}/backups/$(basename "$ARCHIVE")" && \
        echo "Uploaded to S3: s3://${S3_BUCKET}/backups/$(basename "$ARCHIVE")"
fi

Wednesday–Thursday: VPC

The default VPC works for most things. Understanding custom VPCs is the interview signal.

Create a custom VPC:

  1. VPC β†’ Create VPC
    • Name: app-vpc
    • CIDR: 10.0.0.0/16
  2. Create subnets:
    • Public subnet: 10.0.1.0/24, AZ: ap-south-1a
    • Private subnet: 10.0.2.0/24, AZ: ap-south-1b
  3. Create internet gateway, attach to app-vpc
  4. Route table for public subnet: add route 0.0.0.0/0 β†’ internet gateway
  5. Route table for private subnet: no internet route (private)
  6. Launch EC2 in public subnet β€” it gets a public IP and internet access
  7. Understand: an EC2 in private subnet has no internet unless you add a NAT gateway (costs ~$0.045/hour β€” don't provision, just understand)

The interview question version: "What's the difference between a public and private subnet in AWS?" Answer: a public subnet has a route to an internet gateway; a private subnet does not. Resources in a private subnet can reach the internet via a NAT gateway but cannot be reached from the internet directly.


Friday–Saturday: Integrate S3 into the CI/CD pipeline

Update .github/workflows/deploy.yml in vps-multiservice:

- name: Archive previous version to S3
  run: |
    aws s3 cp docker-compose.yml \
      s3://your-backup-bucket/deployments/$(date +%Y%m%d_%H%M%S)/docker-compose.yml
  env:
    AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
    AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
    AWS_DEFAULT_REGION: ap-south-1

This demonstrates integration of S3 into a CI/CD pipeline. The backup is a real operational artifact β€” it means you can reconstruct what was deployed at any given time.


Week 9 β€” CloudWatch

Learning objectives: CloudWatch Logs (log groups, log streams, log agents), metric filters, alarms, custom metrics via boto3


Monday–Tuesday: CloudWatch Logs

Install CloudWatch agent on EC2:

wget https://s3.amazonaws.com/amazoncloudwatch-agent/ubuntu/amd64/latest/amazon-cloudwatch-agent.deb
dpkg -i amazon-cloudwatch-agent.deb

Configure /opt/aws/amazon-cloudwatch-agent/bin/config.json:

{
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/nginx/access.log",
            "log_group_name": "vps-multiservice",
            "log_stream_name": "nginx-access"
          },
          {
            "file_path": "/var/log/nginx/error.log",
            "log_group_name": "vps-multiservice",
            "log_stream_name": "nginx-error"
          }
        ]
      }
    }
  }
}
/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
    -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -s

The EC2 IAM role needs CloudWatch permissions β€” add to the role:

{
  "Effect": "Allow",
  "Action": ["cloudwatch:PutMetricData", "logs:*"],
  "Resource": "*"
}

Wednesday–Thursday: Metric filters + alarms

In the AWS console:

  1. CloudWatch β†’ Log groups β†’ vps-multiservice β†’ nginx-access
  2. Create metric filter: pattern [ip, id, user, timestamp, request, status_code=5*, ...]
  3. Metric name: nginx-5xx, metric value: 1
  4. Create alarm on this metric: if sum > 5 in 5 minutes β†’ SNS notification to your email

Billing alarm (if not done in Week 7):

aws cloudwatch put-metric-alarm \
    --alarm-name "billing-alert-5usd" \
    --alarm-description "Alert when charges exceed $5" \
    --metric-name EstimatedCharges \
    --namespace AWS/Billing \
    --statistic Maximum \
    --period 86400 \
    --threshold 5 \
    --comparison-operator GreaterThanThreshold \
    --dimensions Name=Currency,Value=USD \
    --evaluation-periods 1 \
    --alarm-actions <your-sns-topic-arn>

Friday–Saturday: Custom metrics + Python integration

Add to vps-health-reporter.py β€” a --push-to-cloudwatch flag:

import boto3

def push_to_cloudwatch(metrics_dict, namespace="VPS/Health"):
    cw = boto3.client("cloudwatch", region_name="ap-south-1")
    metric_data = []
    for name, value in metrics_dict.items():
        metric_data.append({
            "MetricName": name,
            "Value": float(value),
            "Unit": "Percent" if "pct" in name.lower() else "Count"
        })
    if metric_data:
        cw.put_metric_data(Namespace=namespace, MetricData=metric_data)
        print(f"Pushed {len(metric_data)} metrics to CloudWatch")

Run this as a cron job on the EC2 every 5 minutes:

crontab -e
# Add:
*/5 * * * * /usr/bin/python3 /opt/scripts/vps-health-reporter.py --push-to-cloudwatch

In the CloudWatch console, verify the custom metrics appear under the VPS/Health namespace.


Week 10 β€” Phase 2 Consolidation

What this week unlocks: You have AWS concretely on your CV. This week polishes everything, prepares interview answers for each project, and pushes the application cadence to 10+/week.

Monday–Wednesday: Documentation and GitHub polish

vps-multiservice README must now include:

  • Architecture diagram: internet β†’ Route53 (or direct IP) β†’ Nginx (SSL) β†’ Docker network β†’ Flask/Redis/Postgres
  • CI/CD flow: push to GitHub β†’ Actions β†’ SSH deploy to EC2 β†’ smoke test
  • AWS resources used: EC2, IAM role, S3 (backups), CloudWatch (logs and metrics)
  • How to reproduce: step-by-step from zero to running

ops-scripts README: add a section on S3 backup integration and CloudWatch metric publishing.

Thursday–Friday: Interview preparation β€” talking about your projects

For each project, prepare a 3-minute verbal answer to "tell me about this project":

  • FactorSphere: "FactorSphere is a live academic journal ranking platform I built that serves real users. The architecture is fully edge-native β€” it runs on Cloudflare Workers rather than a traditional server, which means requests are processed at the edge closest to the user. The backend is a set of serverless microservices analogous to AWS Lambda β€” one for search, one for ranking, one for data processing. I integrated Pinecone as a managed vector database for semantic search and an LLM for query understanding. CI/CD is automated via GitHub Actions and Wrangler. I've documented the full pipeline in the repo."
  • vps-multiservice: "This is a multi-service API stack I deployed on a VPS and on EC2. It runs Flask, Redis, and Postgres in Docker containers orchestrated by Docker Compose. I wrote multi-stage Dockerfiles to minimize image size, configured health checks so Compose knows when each service is ready before starting dependents, and set up Nginx as a reverse proxy with SSL termination. The deployment is automated via GitHub Actions β€” push to main triggers an SSH deploy to the server with a smoke test. Logs ship to CloudWatch and backups go to S3 via an IAM role."

Saturday–Sunday: Application push

By Week 10, increase to 10–15 applications/week. You now have AWS on your CV. Apply to roles that were out of reach at Week 4:

  • "AWS Cloud Engineer (Junior/Fresher)"
  • "Cloud Infrastructure Engineer"
  • GCC roles that specify AWS in the JD

Weeks 11–12 β€” Buffer, Deepening, and Interview Prep

Week 11: Take the weakest area β€” likely whichever of Linux/Docker/AWS you feel least confident explaining verbally β€” and go deeper. Run through interview scenarios: set up a broken Docker network and diagnose it, break an Nginx config and read the error, terminate an EC2 instance and relaunch from scratch without notes.

Week 12: Final project polish, README quality pass on all repos, continue applying. If interviews are happening, focus prep here on the specific company's stack.


PHASE 3: Weeks 13–24 β€” Become Genuinely Competent


Weeks 13–15 β€” Terraform

What this unlocks: Infrastructure as Code is expected at mid-level and increasingly mentioned in junior JDs. More importantly, after doing Weeks 7–9 by hand in the console and CLI, Terraform will click immediately β€” you already understand the resources, now you're just defining them declaratively.

Week 13: Terraform core concepts β€” providers, resources, state, plan/apply/destroy. Write Terraform that creates the Week 7/8 infrastructure: EC2, security groups, IAM role, S3 bucket. Run terraform plan to see what it would create. Run terraform apply. Verify it matches what you built manually.

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
  backend "s3" {
    bucket = "your-terraform-state"
    key    = "vps-multiservice/terraform.tfstate"
    region = "ap-south-1"
  }
}

resource "aws_instance" "app" {
  ami           = "ami-0287a05f0ef0e9d9a"  # Ubuntu 22.04, ap-south-1
  instance_type = "t2.micro"
  key_name      = aws_key_pair.deploy.key_name
  subnet_id     = aws_subnet.public.id
  vpc_security_group_ids = [aws_security_group.web.id]
  iam_instance_profile   = aws_iam_instance_profile.app.name
  user_data = file("bootstrap.sh")

  tags = {
    Name    = "vps-multiservice"
    Project = "portfolio"
  }
}

Week 14: Terraform state, modules, variables, outputs. Write a module for the EC2 + security group pair so it's reusable. Store state remotely in S3 (with DynamoDB locking β€” use an S3 backend, DynamoDB is free at this scale).

Week 15: Terraform for the full Week 7–9 stack: EC2, VPC, subnets, route tables, internet gateway, IAM roles, S3 buckets, CloudWatch log group. The entire infrastructure is now in infrastructure-iac repo. terraform apply from scratch should produce a working environment.

Repository: infrastructure-iac β€” contains Terraform for all AWS infrastructure, README explaining what it creates and why, .terraform.lock.hcl committed, state backend configured, variables documented in variables.tf


Weeks 16–17 β€” Prometheus + Grafana

What this unlocks: Monitoring stack demonstrates operational maturity beyond deployment. "Have you set up monitoring?" is a common interview question; being able to say yes with a GitHub repo is a strong differentiator at junior level.

Week 16: Deploy Prometheus and the Node Exporter on the Hetzner VPS (not EC2 β€” keep costs at zero here). Prometheus scrapes Node Exporter for system metrics (CPU, memory, disk, network). Scrape the Flask API's /metrics endpoint (add prometheus-flask-exporter to the app). Prometheus config:

# /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: node
    static_configs:
      - targets: ['localhost:9100']

  - job_name: flask-api
    static_configs:
      - targets: ['localhost:5000']

Run everything in Docker Compose β€” add Prometheus and Node Exporter to the existing docker-compose.yml in vps-multiservice:

prometheus:
  image: prom/prometheus:latest
  volumes:
    - ./prometheus.yml:/etc/prometheus/prometheus.yml
    - prometheus_data:/prometheus
  ports:
    - "9090:9090"

node_exporter:
  image: prom/node-exporter:latest
  network_mode: host
  pid: "host"
  volumes:
    - /proc:/host/proc:ro
    - /sys:/host/sys:ro
    - /:/rootfs:ro

Week 17: Deploy Grafana, connect it to Prometheus as a data source, build dashboards:

  • System overview: CPU, memory, disk, network I/O
  • Flask API: request rate, error rate, latency (p50, p95, p99)
  • Alerts in Alertmanager: email when disk > 80% or API error rate > 5%

Export your dashboards as JSON and commit them to the repo. This means the monitoring stack is reproducible.


Weeks 18–22 β€” Kubernetes

Five weeks because the surface area is large and the concepts require time to internalize.

Week 18: Kubernetes architecture β€” what a Node is, what a Pod is, what a Deployment is, how the control plane works (API server, scheduler, controller manager, etcd). Install k3s on a local KVM VM (not on the VPS β€” k3s with multiple services will use too much RAM on CX22):

curl -sfL https://get.k3s.io | sh -
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
kubectl get nodes
kubectl get pods -A

Write Kubernetes manifests for the Flask API:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: flask-api
spec:
  replicas: 2
  selector:
    matchLabels:
      app: flask-api
  template:
    metadata:
      labels:
        app: flask-api
    spec:
      containers:
      - name: flask-api
        image: myapi:latest
        ports:
        - containerPort: 5000
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: app-secrets
              key: database-url
        livenessProbe:
          httpGet:
            path: /health
            port: 5000
          initialDelaySeconds: 10
          periodSeconds: 30
        resources:
          requests:
            memory: "64Mi"
            cpu: "100m"
          limits:
            memory: "128Mi"
            cpu: "200m"
---
apiVersion: v1
kind: Service
metadata:
  name: flask-api
spec:
  selector:
    app: flask-api
  ports:
  - port: 80
    targetPort: 5000

Weeks 19–20: ConfigMaps, Secrets, Ingress (using k3s's built-in Traefik ingress controller), PersistentVolumes for Postgres. Deploy the full multi-service stack to Kubernetes.

Weeks 21–22: Rolling updates, rollbacks (kubectl rollout undo deployment/flask-api), Horizontal Pod Autoscaler, namespace isolation, RBAC basics. Write a GitHub Actions workflow that builds a Docker image, pushes to Docker Hub or GitHub Container Registry, and updates the k3s deployment via kubectl set image.

Repository: k8s-fundamentals β€” all manifests in manifests/, Kustomize or Helm intro in Week 22, README explaining architecture decisions


Weeks 23–24 β€” Consolidation

Audit all repos: every README answers what it runs, how to run it, what decisions were made, and what you'd do differently with more time. This last part β€” "what I'd do differently" β€” is an advanced interview signal. Examples: "I'd add a proper secrets management solution (Vault or AWS Secrets Manager) instead of .env files", "I'd move Terraform state to a proper remote backend with team locking", "I'd add structured logging so Prometheus can ingest log metrics directly."

Write up a "portfolio narrative" β€” a one-page document (not on GitHub, for your own use) that ties everything together: FactorSphere shows edge-native architecture and LLM integration; vps-multiservice shows traditional server ops; the Terraform repo shows IaC; the monitoring stack shows observability; the Kubernetes repo shows container orchestration. This is your answer to "walk me through your projects" in a senior technical round.


END SECTION


1. GitHub Repositories by Milestone

Week 4 Repositories

ops-scripts β€” bash health check, backup, and log analyzer scripts; Python endpoint monitor, deploy check, Cloudflare deploy monitor, config validator, and VPS health reporter. Why a recruiter cares: it demonstrates ability to automate real operational tasks β€” the most common take-home test format. Scripts that run on real infrastructure (not mock data) are distinguishable immediately.

vps-multiservice β€” Docker Compose stack (Flask + Redis + Postgres), multi-stage Dockerfile, .env.example, Makefile with operational targets, systemd service file, and README documenting architecture decisions. Why a recruiter cares: it's a real multi-service production-style stack running on actual infrastructure, not a tutorial clone. The health check endpoint that checks dependency connectivity is a concrete signal of operational thinking.

factorsphere (existing, updated) β€” docs/ARCHITECTURE.md and docs/CICD.md added. Why a recruiter cares: live product with real users transforms this from a class project into a production deployment story. The architecture documentation demonstrates that you understand what you built, not just that you ran commands.

Week 8 Repositories

vps-multiservice (updated) β€” now includes GitHub Actions CI/CD workflow deploying to both VPS and EC2, Nginx reverse proxy configuration, CloudWatch log shipping, S3 backup integration, and updated README with the full AWS architecture. A badge shows the live CI/CD status. Why a recruiter cares: this is a complete junior DevOps portfolio project covering Docker, CI/CD, AWS, monitoring, and backups β€” the canonical skill set for a junior role.

ops-scripts (updated) β€” S3 backup integration in backup.sh, CloudWatch metric publishing in vps-health-reporter.py, updated README. Why a recruiter cares: scripts that interact with real cloud services (boto3, AWS CLI) demonstrate practical cloud knowledge beyond "I know what S3 is."

Week 24 Repositories

infrastructure-iac β€” Terraform modules for the full AWS stack (EC2, VPC, subnets, security groups, IAM, S3). terraform apply from a clean account produces the entire vps-multiservice environment. Why a recruiter cares: IaC in any junior portfolio is unusual and signals maturity; Terraform specifically is the most common IaC tool in NCR DevOps JDs.

monitoring-stack β€” Prometheus, Grafana, and Alertmanager in Docker Compose, dashboard JSON files committed, Prometheus recording rules, Alertmanager config with email notifications. Why a recruiter cares: monitoring is the answer to "how do you know your service is healthy?" β€” most junior candidates can't demonstrate this.

k8s-fundamentals β€” Kubernetes manifests for the multi-service stack running on k3s, GitHub Actions pipeline that builds and deploys to the cluster, RBAC configuration, HPA configuration. Why a recruiter cares: Kubernetes is increasingly mentioned even in junior NCR DevOps JDs; a working cluster with manifests is proof of hands-on experience, not certification prep.


2. Job Titles and Application Timing

Apply now β€” Week 4

Titles: Junior DevOps Engineer, DevOps Engineer (0-2 years), Cloud Support Engineer, Infrastructure Engineer, Systems Engineer, Associate DevOps Engineer

Platforms: LinkedIn (primary β€” set alerts), Naukri.com (mandatory β€” service companies and GCCs post here exclusively), Instahyre, Wellfound

Company targets in NCR: Nagarro (Gurgaon), GlobalLogic (Noida), Publicis Sapient (Gurgaon), NIIT Technologies, Mphasis, smaller funded startups from Antler portfolio and other VC-backed companies. For service companies: HCL Technologies and Wipro have DevOps JDs that are genuine infrastructure work (not all of them β€” filter by the actual JD content, not just the title).

Direct career pages worth bookmarking: nagarro.com/careers, globallogic.com/careers, publicissapient.com/careers, mphasis.com/careers. These pages often have roles that don't appear on aggregators for 1–2 weeks after posting.

Apply at Week 8

You now have AWS concretely on your CV. Expand to: AWS Cloud Engineer (Junior), Cloud Infrastructure Engineer, Cloud Operations Engineer, DevOps Engineer roles that specify AWS in the JD.

New targets: GCCs that explicitly require AWS β€” Optum (Noida), Genpact Technology (Gurgaon), EXL Service, WNS, Concentrix Technology. Apply via their career pages and Naukri simultaneously. Your profile is now meaningfully stronger: end-to-end AWS stack (EC2, IAM, S3, VPC, CloudWatch) with CI/CD and a live project to discuss.

Do not apply until Week 24

Platform Engineer β€” typically requires Kubernetes + Terraform + 2+ years. Platform teams at larger companies in NCR have a higher bar.

Senior DevOps Engineer β€” even at companies that post "1-3 years experience required", the actual expectation is 2+ years of real DevOps work. Applying earlier wastes your time.

DevSecOps Engineer β€” security tooling (Vault, Snyk, SAST/DAST pipelines) is a domain requiring deliberate study not on this roadmap.

Cloud Architect β€” requires 5+ years and cert-level AWS knowledge.


3. What Gets Tested in NCR DevOps Interviews

Phone screen / HR call

You will be asked: your experience summary (have a 90-second version), which tools you've used (answer specifically β€” not "I know Docker", but "I've written multi-stage Dockerfiles and run Compose stacks on a VPS"), availability and notice period (fresh grad = immediate joining), salary expectation (state your target, not your floor β€” see Section 5), why DevOps specifically.

They are screening for: articulate communication, genuine experience (not just listed on CV), salary fit, and basic logical thinking. They are not testing technical knowledge here.

Technical round

Linux questions that actually appear:

  • "What does the first column of ps aux output mean?" (process owner)
  • "How do you find which process is using port 8080?" (ss -tlnp | grep 8080 or lsof -i :8080)
  • "What's the difference between kill -9 and kill -15?" (SIGKILL vs SIGTERM β€” immediate termination vs graceful)
  • "A service is failing to start. Walk me through your diagnosis." (systemctl status, journalctl -u servicename -n 50, check the binary exists and permissions are correct, check the port isn't already in use)
  • "What is load average?" (average number of processes waiting to run over 1/5/15 minutes, relative to CPU cores)

Docker questions that actually appear:

  • "What's the difference between CMD and ENTRYPOINT in a Dockerfile?" (CMD is overridable at run time, ENTRYPOINT is fixed β€” CMD provides default arguments to ENTRYPOINT)
  • "How do containers in a Docker Compose network resolve each other?" (by service name β€” Docker provides built-in DNS for user-defined networks)
  • "What's a multi-stage build and why use it?" (multiple FROM instructions, only the final stage goes to the image β€” keeps build tools out of the production image, reduces size and attack surface)
  • "How do you make a Docker container restart automatically?" (--restart unless-stopped or restart: unless-stopped in Compose)

Networking questions that actually appear:

  • "Walk me through what happens when a user visits a URL" (DNS lookup β†’ TCP connection β†’ TLS handshake β†’ HTTP request β†’ response)
  • "What's the difference between a reverse proxy and a load balancer?" (reverse proxy hides the backend; load balancer distributes across multiple backends β€” Nginx can do both)
  • "What does a 502 error mean?" (bad gateway β€” the proxy received an invalid response from the upstream server)
  • "How does SSH authentication work?" (client presents public key, server challenges with random data, client signs it with private key, server verifies the signature with the stored public key)

Scripting:

  • You may be asked to write a Bash or Python script live. Common formats: "write a script that checks if these services are running and restarts any that aren't", "write a function that parses this log file and counts occurrences of each status code"
  • The evaluator is checking: do you use proper error handling (set -euo pipefail in bash, try/except in Python), do you use functions, do you write readable code

Practical / take-home task

Common formats (2–4 hours):

  • "Write a Dockerfile for this Python app, a Docker Compose file that adds Redis, and a health check endpoint" β€” you'll have a GitHub repo to fork
  • "Write a GitHub Actions workflow that runs tests and deploys to a remote server on push to main" β€” they provide fake SSH credentials or ask you to describe what the secrets would contain
  • "Write a Python script that monitors these endpoints and emails you if any return non-200" β€” tests requests, SMTP, argparse
  • "Debug this broken docker-compose.yml" β€” they give you a file with 3-5 intentional errors

What separates good submissions: health checks are included, .env.example is present, README explains what it does and how to run it, commit history shows incremental work (not one giant commit), you handle error cases explicitly.


4. Common Mistakes That Prevent Freshers From Getting DevOps Jobs in India

Listing technologies you can't defend. If "Kubernetes" is on your CV because you ran kubectl get pods once in a tutorial, an interviewer asking "how does a rolling update work?" will end the conversation. Every technology on your CV must have a project behind it and a clear verbal answer to "tell me how you used this."

GitHub repos without READMEs. A recruiter spending 30 seconds on a repo with no README closes the tab. A repo called docker-practice with three files and no explanation tells a hiring manager nothing about your ability to operate infrastructure. Every repo needs to explain what it runs, why it exists, and what decisions were made.

Applying to renamed helpdesk roles. Many "DevOps Engineer" JDs at service companies are L1 support with a rebranded title. Read the JD carefully β€” red flags include: "ITIL certification preferred", "ticketing system experience required", "incident management", no mention of Docker/AWS/Linux in the technical requirements. These roles do not build the skills you need and often pay below the floor you've set.

Not knowing your own project in depth. "I deployed a multi-service Docker Compose stack" is not an answer. "I deployed a Flask API with Redis and Postgres in Docker Compose, using multi-stage builds to reduce image size from 1.1GB to 180MB, with health checks so Postgres is confirmed ready before the API starts, running behind Nginx with SSL termination, and automated via GitHub Actions" is an answer. Practice this verbally, not just in your head.

Treating salary floor as the opening number. If you say "I'm looking for around β‚Ή35,000" at a startup that would pay β‚Ή55,000 for your profile, you've lost β‚Ή20,000/month permanently. Know the market rates (see Section 5), start at your target ceiling for each company tier, not your floor.

One-commit GitHub histories. A repo where all the work appears in a single commit titled "add project files" signals that you copied files over rather than built incrementally. Commit as you work. The commit history is evidence of your process.

Not applying early enough. The job search takes time independent of your preparation level. Candidates who start applying at Week 4 get their first offers at Week 8–12. Candidates who wait until they feel "ready" start at Week 8 and get their first offers at Week 14–18. The feedback from real rejections improves your interview performance faster than another week of studying.

Inconsistent positioning. Applying to a DevOps role and then mentioning in the interview that you also do React or full-stack work signals that you don't actually want a DevOps role. Decide on the role type and be consistent in every touch point β€” resume, LinkedIn, outreach, interview answers.

Vague answers to "tell me about your experience." "I worked at an Antler-backed company on various projects" is vague. "I was a software developer intern at a venture studio backed by Antler, where I shipped production SaaS MVPs across multiple projects β€” I owned Docker configuration and CI/CD pipelines across three projects, integrated with external APIs, and worked with distributed teams across time zones" is not vague.

Sending generic outreach. A LinkedIn message that could have been sent to anyone ("I am a passionate DevOps professional seeking opportunities") gets ignored. One that references the company's actual stack or a specific JD they posted gets responses. Take 3 minutes per message to make it specific.


5. Realistic Salary Ranges

NCR service companies (TCS, HCL, Wipro, Infosys, Tech Mahindra)

Range: β‚Ή3.5–5 LPA (CTC). Take-home is ~70–75% of CTC after PF, tax, insurance deductions. At β‚Ή4.5 LPA CTC, take-home is approximately β‚Ή28,000–31,000/month.

Your profile puts you toward the upper end of the fresher band, but service companies have fixed fresher slabs that don't move much for individual profile quality. The internship experience may place you in the "experienced fresher" band at some companies (β‚Ή4.5–5.5 LPA) versus the "campus fresher" band (β‚Ή3.5–4 LPA).

Honest assessment: these roles are the right floor for negotiation, not the target. The actual work at L1/L2 entry in service companies is often not genuine infrastructure. The learning environment is slower. Treat these as the fallback, not the goal.

Skills that move you up within this band: AWS certification (not worth getting for this tier, though), ITIL knowledge (not worth learning for this tier either). Don't optimize for this band.

NCR funded startups (Series A–C, Antler portfolio)

Range: β‚Ή5–8 LPA for a genuine junior DevOps role. At β‚Ή6 LPA, take-home is approximately β‚Ή42,000–45,000/month depending on structure.

Your Antler studio connection is a direct advantage here. Antler portfolio companies know what their studio interns produce β€” the signal is stronger than a cold application. FactorSphere as a live production product with real users is unusual for a fresher; most candidates this stage have tutorial clones.

Your realistic target at a well-funded startup: β‚Ή6–7 LPA at Week 4, negotiable to β‚Ή7–8 LPA after Week 8 with AWS on the CV.

Skills that move you up this band: AWS working knowledge (Week 8), ability to own deployment pipelines end-to-end without supervision, Docker and Compose at production level (Week 4), Python scripting that actually runs in their infrastructure.

Product companies and GCCs (Nagarro, GlobalLogic, Publicis Sapient, Optum, Genpact Tech, EXL)

Range: β‚Ή6–10 LPA for cloud/DevOps roles. At β‚Ή8 LPA, take-home is approximately β‚Ή55,000–60,000/month.

GCCs pay more than domestic companies for equivalent work β€” they're paying against a global compensation benchmark. The tradeoff is a higher technical bar at screening and more structured interview processes.

Your profile is competitive here after Week 8 (AWS added). Before Week 8, your serverless and Cloudflare experience is harder to map to what GCC interviewers are looking for; after Week 8, the AWS + Docker + CI/CD story is a clean match.

Realistic target at a GCC: β‚Ή7–9 LPA at Week 8, with AWS and a demonstrated CI/CD project.

Skills that move you up this band: specific AWS service depth (beyond EC2/S3 β€” RDS, ECS, ECR, CodePipeline), monitoring/observability (Prometheus or CloudWatch), IaC awareness (Terraform in Phase 3).

Your Week 4 realistic ceiling (before AWS): β‚Ή6–7 LPA at a funded startup or mid-size product company where your FactorSphere + Docker + CI/CD story lands well.

Your Week 8 realistic ceiling (after AWS): β‚Ή8–9 LPA at a GCC or strong product company.

Your floor (non-negotiable): β‚Ή35,000/month = ~β‚Ή4.2 LPA. Achievable at service companies. Don't accept less β€” it's below market even at service companies for a profile with live production experience.

Negotiation: When a recruiter asks for your expectation, say the target number for that company tier, not your floor. "I'm looking for β‚Ή6–7 LPA, based on my production experience and the market for junior DevOps roles in NCR." If they push back, ask what the band is before accepting or declining.


6. Honest Assessment of the Week 4 Target

Week 4 interview-readiness for junior DevOps roles at startups and mid-size product companies in NCR is realistic. This is not inflated encouragement β€” it's based on what your profile actually produces by Week 4:

You arrive at Week 4 with: Python competency, Linux daily driver experience, Docker from internship, Git fluency, a live production product with real users, and production CI/CD experience. These are not nothing. Most fresher DevOps candidates have none of the internship context and only classroom exposure.

By Week 4 you add: ops-level Linux (systemd, SSH hardening, process management, log analysis), practical networking (DNS, HTTP, SSH tunneling, firewalls), DevOps Python scripting (subprocess, requests, YAML/JSON, proper CLIs), and production Docker + Compose (multi-stage builds, health checks, named volumes, systemd-managed stacks).

This is enough to pass a phone screen and a standard junior technical round at a startup or mid-tier product company. It is not enough to pass a rigorous GCC technical screen that probes AWS depth β€” that happens after Week 8.

Most likely blocker: The gap between knowing how something works and being able to explain it under mild interview pressure. Docker networking questions specifically β€” "a container in service A can't reach service B, what do you check?" β€” require not just knowing the answer but being comfortable walking through it out loud. This gap closes with deliberate verbal practice, not with more studying. Spend 30 minutes every day of Week 4 talking through your projects as if you're in an interview. Record yourself if you can β€” the first playback will tell you exactly what to fix.

Second most likely blocker: Sparse GitHub repos that don't match your verbal claims. If you say "I built a multi-service Docker Compose stack on a VPS" and the recruiter opens the repo to find three files and no README, you've undermined your own story. Prioritize clean, documented repos with meaningful commit history over adding more features.

Week 6–8 for first offer: Achievable if you apply at 10+ per week starting Week 4 and follow up on applications actively. Realistic for the right startup or product company.

Week 8–10 for first offer: The more likely outcome for most candidates, accounting for interview scheduling delays (common in India), HR processes, and the normal distribution of fit between your profile and open roles. This is not a failure case β€” it's the median outcome for a candidate with your profile executing this plan.

Week 6–8 for GCC offer: Unlikely. GCC processes in NCR typically take 4–6 weeks from application to offer even when you pass every round. Apply to GCCs at Week 8 and expect the offer to come at Week 12–14.

The 4-week interview-readiness target is sound. The 6–8 week first offer target is optimistic but achievable. The 8–10 week target is realistic without being pessimistic. The answer to "should I start applying at Week 4 even if I don't feel ready?" is yes, unambiguously.

Preserved source markdown
# DevOps Career Roadmap β€” NCR, 2025–26

---

## DSA β€” Answered Directly

**Do NCR DevOps, Cloud Support, and Infrastructure roles screen for LeetCode-style DSA?**

At funded startups targeting DevOps specifically: no. The technical screen is scripting (Python or Bash), Linux troubleshooting scenarios, and Docker/Compose tasks. No arrays, graphs, or dynamic programming.

At product companies and GCCs: mostly no, but with a real exception. Some GCCs (Optum, Genpact Tech, Publicis Sapient) route all applicants through a standardized first-round online assessment that includes basic coding β€” not algorithmic complexity, but "write a function that does X" type problems a second-year CS student should handle. If you apply to these through their career portal (not via referral or recruiter), you may hit this screen. The coding is at the level of "reverse a string without slicing" or "count words in a paragraph" β€” not LeetCode mediums. If you've written Python for two years, you're fine. No prep needed.

At service companies (TCS, Wipro, HCL, Infosys): their entry tests include aptitude + basic coding. The coding section is trivial CS101 material, not competitive programming. If this is the blocker, 2 hours of practice on basic Python problems clears it.

**Conclusion:** No LeetCode prep is needed. No LeetCode appears in this roadmap under any framing. The only coding you'll write as interview prep is ops-relevant Python scripting, which is on the roadmap as a skill anyway.

---

## Technology Rationale Table

| Technology | Immediate Employability Impact | Learnable on the Job? | Why It's Here |
|---|---|---|---|
| Linux (ops-specific) | High | Partially | Every VPS, EC2, and container runs Linux; ops-level commands are tested directly in technical rounds |
| Git (infra conventions) | Medium | Yes | Ops repos, GitOps patterns, and READMEs are how hiring managers evaluate your actual work product |
| Networking | High | Partially | DNS, SSH, firewall, and HTTP debugging appear explicitly in junior DevOps technical screens in NCR |
| Python scripting | High | Partially | Automation scripts are the most common take-home task format; real scripts separate you from CV-padders |
| Docker | High | No | In nearly every junior DevOps JD in NCR; multi-stage builds and Compose are now baseline expectations |
| Docker Compose | High | Partially | Multi-service Compose stacks are the standard deployment unit at companies without Kubernetes |
| Nginx | High | Partially | Reverse proxy knowledge is expected at junior level; SSL termination and upstreams come up in interviews |
| GitHub Actions | High | Yes | CI/CD is table stakes; VPS deploy pipelines are more transferable than Cloudflare-specific ones |
| AWS EC2/IAM/S3/VPC | High | Partially | Most NCR product companies and GCCs run on AWS; differentiates you from candidates who only know on-prem |
| CloudWatch | Medium | Yes | Logging and basic monitoring show operational awareness; AWS-native so low friction to learn |
| Terraform | Medium | Yes | Increasingly in mid-level JDs; rarely a hard gate at junior level in NCR, but Phase 3 depth pays off post-hire |
| Prometheus | Medium | Yes | Monitoring stack is a bonus at junior level; signals genuine ops thinking beyond just deployment |
| Grafana | Medium | Yes | Dashboard work demonstrates operational maturity; trivial to add once Prometheus is running |
| Kubernetes | Medium | No | Appearing in NCR JDs even at junior level now; large surface area justifies Phase 3 placement |

---

## PHASE 1: Weeks 1–4 β€” Become Employable

---

### Week 1 β€” Linux (Ops-Specific) + Git (Infra Conventions)

**What this week unlocks:** Your VPS becomes a real work environment, not a toy. You can answer Linux troubleshooting questions in interviews. FactorSphere gets professional documentation that turns a live project into an interview story.

**Study hours: 42**

**Learning objectives:**
- Ops-level networking stack: reading active connections, capturing traffic, diagnosing from the command line
- SSH hardening: key-only auth, non-root user, config file discipline
- systemd: writing, enabling, and managing service units from scratch
- Bash scripting for automation: health checks, backups, log analysis
- Process and resource management at ops level
- Log analysis with journalctl and standard log tooling
- How infrastructure repos differ from application repos; writing READMEs a hiring manager actually reads
- Professional documentation for FactorSphere CI/CD pipeline and architecture

**Technologies:** Linux (Ubuntu 22.04 on Hetzner VPS), Bash, systemd, UFW, SSH, journalctl, Git

---

**Monday–Tuesday: VPS baseline and network stack**

SSH into the Hetzner VPS. If you're logging in as root, fix that first.

```bash
adduser deploy
usermod -aG sudo deploy
```

SSH hardening β€” edit `/etc/ssh/sshd_config`:
```
PasswordAuthentication no
PermitRootLogin no
AllowUsers deploy
```

Copy your public key to the new user:
```bash
ssh-copy-id -i ~/.ssh/id_ed25519.pub deploy@<vps-ip>
```

Generate a dedicated key if you don't have one: `ssh-keygen -t ed25519 -C "vps-deploy-key"`

Restart sshd: `systemctl restart sshd`. Verify you can log in as `deploy` before ending the root session.

UFW setup:
```bash
ufw default deny incoming
ufw default allow outgoing
ufw allow from <your-home-ip> to any port 22
ufw allow 80/tcp
ufw allow 443/tcp
ufw enable
ufw status verbose
```

Networking stack commands β€” run these and understand every line of output:
```bash
ip addr show
ip route show
ss -tlnp          # listening TCP sockets with process names
ss -tulnp         # TCP + UDP
netstat -tlnp     # older but still appears in interviews
```

Install tcpdump: `apt install tcpdump`. Run:
```bash
tcpdump -i eth0 port 22   # watch your own SSH session
tcpdump -i eth0 port 80 -A  # see HTTP traffic in ASCII
```

You're not becoming a packet analysis expert. You're learning to answer "how would you debug a connectivity issue" with real commands.

---

**Wednesday: systemd**

Write a real systemd service. Create a minimal Python HTTP server first:

```python
# /opt/healthapi/server.py
import http.server, json
class Handler(http.server.BaseHTTPRequestHandler):
    def do_GET(self):
        self.send_response(200)
        self.send_header('Content-Type', 'application/json')
        self.end_headers()
        self.wfile.write(json.dumps({"status": "ok"}).encode())
    def log_message(self, *args): pass

http.server.HTTPServer(('', 8080), Handler).serve_forever()
```

Service unit `/etc/systemd/system/healthapi.service`:
```ini
[Unit]
Description=Health API
After=network.target

[Service]
Type=simple
User=deploy
WorkingDirectory=/opt/healthapi
ExecStart=/usr/bin/python3 /opt/healthapi/server.py
Restart=on-failure
RestartSec=5
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
```

```bash
systemctl daemon-reload
systemctl enable healthapi
systemctl start healthapi
systemctl status healthapi
curl http://localhost:8080
journalctl -u healthapi -f       # tail logs
journalctl -u healthapi --since "1 hour ago"
journalctl -p err -u healthapi   # errors only
```

Stop the service, break it intentionally (wrong path), observe `systemctl status` output. This is what debugging looks like.

---

**Thursday: Bash scripting for automation**

Write three scripts. These go on GitHub. They are real deliverables, not exercises.

`health-check.sh` β€” VPS health report:
```bash
#!/bin/bash
set -euo pipefail

THRESHOLD_DISK=80
THRESHOLD_MEM=90

echo "=== VPS Health Check $(date) ==="

# Disk
DISK_USE=$(df / | awk 'NR==2 {print $5}' | tr -d '%')
echo "Disk usage: ${DISK_USE}%"
[ "$DISK_USE" -gt "$THRESHOLD_DISK" ] && echo "WARNING: disk above ${THRESHOLD_DISK}%" >&2

# Memory
MEM_TOTAL=$(free -m | awk '/Mem:/ {print $2}')
MEM_USED=$(free -m | awk '/Mem:/ {print $3}')
MEM_PCT=$(( MEM_USED * 100 / MEM_TOTAL ))
echo "Memory usage: ${MEM_PCT}% (${MEM_USED}/${MEM_TOTAL} MB)"

# Services β€” edit list for your environment
for SVC in healthapi nginx docker; do
    STATUS=$(systemctl is-active "$SVC" 2>/dev/null || echo "not-installed")
    echo "Service $SVC: $STATUS"
done

# Open ports
echo "Listening ports:"
ss -tlnp | awk 'NR>1 {print $1, $4, $6}'
```

`backup.sh` β€” timestamped archive:
```bash
#!/bin/bash
set -euo pipefail

SRC="${1:?Usage: backup.sh <source-dir>}"
DEST="/var/backups"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
ARCHIVE="${DEST}/backup_${TIMESTAMP}.tar.gz"

mkdir -p "$DEST"
tar czf "$ARCHIVE" "$SRC"
echo "Backup created: $ARCHIVE ($(du -sh "$ARCHIVE" | cut -f1))"

# Keep last 7 backups
ls -t "${DEST}"/backup_*.tar.gz | tail -n +8 | xargs -r rm
echo "Old backups pruned. Current count: $(ls "${DEST}"/backup_*.tar.gz | wc -l)"
```

`log-analyzer.sh` β€” parse journalctl for errors in the last N hours:
```bash
#!/bin/bash
HOURS="${1:-24}"
echo "=== Error summary: last ${HOURS} hours ==="
journalctl --since "${HOURS} hours ago" -p err --no-pager | \
    awk '{print $5}' | sort | uniq -c | sort -rn | head -20
```

Make them executable (`chmod +x`), test them, commit them with a meaningful message: `feat(scripts): add VPS health check, backup, and log analysis scripts`.

---

**Friday–Saturday: Process management, disk, LVM awareness, log deep dive**

Process management:
```bash
ps aux --sort=-%cpu | head -10    # top CPU consumers
ps aux --sort=-%mem | head -10    # top memory consumers
kill -9 <pid>                     # SIGKILL
kill -15 <pid>                    # SIGTERM (graceful)
nice -n 10 <command>              # lower priority
renice -n 5 -p <pid>              # change running process priority
```

`top` and `htop` β€” know what load average means. A load average of 1.0 on a single-core machine = 100% utilization. On a 4-core machine, 1.0 = 25%. This comes up in interviews.

Disk:
```bash
df -h                  # filesystem usage
du -sh /var/*          # directory sizes
lsblk                  # block devices
fdisk -l               # partition table
```

LVM β€” you may not have LVM on the VPS, but understand the commands:
```bash
pvs    # physical volumes
vgs    # volume groups
lvs    # logical volumes
```

Log analysis:
```bash
journalctl --since "2025-01-01" --until "2025-01-02"
journalctl -u nginx --no-pager | grep "502" | wc -l
grep -E "ERROR|WARN" /var/log/syslog | tail -50
zcat /var/log/syslog.2.gz | grep ERROR   # compressed log files
```

---

**Sunday: Git infra conventions + FactorSphere documentation**

This is the most interview-impactful work of the week.

Create `docs/` in the FactorSphere repo. Write two files:

**`ARCHITECTURE.md`** β€” cover: why Cloudflare Workers instead of a traditional server (latency, no cold starts at edge, cost at zero users), how requests flow (DNS β†’ Cloudflare edge β†’ Worker β†’ Pinecone/LLM β†’ response), data flow for the ranking pipeline (source aggregation β†’ processing Workers β†’ Pinecone indexing β†’ query Workers), why Pinecone over a hosted Postgres vector extension (managed, no infra to maintain), how the frontend on Cloudflare Pages is decoupled from the Worker API, what the tradeoffs are (no persistent state in Workers, Pinecone vendor lock-in, cold start behavior). Use a Mermaid diagram:

```
graph LR
  User --> CF_Edge[Cloudflare Edge]
  CF_Edge --> Worker_API[Workers API]
  Worker_API --> Pinecone[(Pinecone Vector DB)]
  Worker_API --> LLM[LLM Inference]
  CF_Pages --> Worker_API
```

**`CICD.md`** β€” cover: what triggers a deploy (push to `main`), the GitHub Actions workflow steps (lint β†’ type check β†’ Wrangler deploy), what `wrangler deploy` actually does (bundles the Worker, uploads to Cloudflare's edge network), how Cloudflare Pages handles frontend deploys automatically (build hook on push), what happens on failure (GitHub Actions marks the workflow run as failed, Wrangler does not promote the broken version β€” previous version stays live), how to roll back (revert commit + push, or `wrangler rollback` to a previous deployment ID), what environment variables are injected at deploy time vs stored as Cloudflare secrets.

Write this so you can recite it verbally in a 3-minute interview answer. That's the test.

**Infra repo conventions:**
- READMEs in infrastructure repos answer: what this runs, how to run it, what environment variables it needs, what the architecture looks like, and what decisions were made and why
- Application repos explain features; infrastructure repos explain operations
- Commit messages in infra repos are imperative and specific: `fix(nginx): increase worker_connections to handle spike load` not `update config`
- Store `.env.example` with all variable names and no values. Never `.env`.

**Deliverables, Week 1:**
- Hetzner VPS: non-root `deploy` user, SSH key-only auth, UFW configured, `healthapi` systemd service running
- GitHub repo `ops-scripts`: `health-check.sh`, `backup.sh`, `log-analyzer.sh` with meaningful commits and a README
- FactorSphere repo: `docs/ARCHITECTURE.md` and `docs/CICD.md` β€” professional, interview-ready

---

### Week 2 β€” Networking (Practical)

**What this week unlocks:** You can answer every networking question in a junior DevOps technical round. DNS debugging, HTTP troubleshooting, SSH advanced usage, and firewall diagnosis are the most common technical screen topics. This week makes you competent at all of them.

**Study hours: 42**

**Learning objectives:**
- DNS: resolution chain, record types, TTL, cache behavior, practical debugging tools
- HTTP: headers, status codes, TLS handshake, what curl reveals
- TCP/IP: three-way handshake, port states, socket inspection
- SSH: config file, tunneling, port forwarding, agent forwarding
- Firewalls: UFW rule management, iptables literacy, nftables awareness
- tcpdump for real diagnosis

---

**Monday–Tuesday: DNS**

Install: `apt install dnsutils` on the VPS if not present.

```bash
dig factorsphere.org                    # full answer section
dig +short factorsphere.org             # just the IP
dig +trace factorsphere.org             # full resolution chain from root
dig @8.8.8.8 factorsphere.org           # force a specific resolver
dig @1.1.1.1 factorsphere.org           # Cloudflare resolver
dig MX factorsphere.org                 # mail records
dig TXT factorsphere.org                # SPF, DKIM, verification records
dig CNAME www.factorsphere.org          # canonical name
dig -x <ip-address>                     # reverse lookup
```

Read the AUTHORITY SECTION and ADDITIONAL SECTION in `dig` output. Understand what TTL means β€” if you change a DNS record, traffic won't switch until TTL expires. This is how to answer "how long does DNS propagation take?"

Watch DNS queries in real time:
```bash
tcpdump -i any port 53 -n
# Open another terminal and run: dig google.com
# Watch the query and response packets appear
```

Understand `/etc/resolv.conf` (which nameserver your system queries) and `/etc/hosts` (local override, checked before DNS). Add an entry to `/etc/hosts` that maps a fake hostname to localhost, verify it resolves, then remove it.

Configure a real subdomain if you have a domain β€” point `vps.yourdomain.com` to your Hetzner IP as an A record. Verify with `dig`. This demonstrates you've actually managed DNS, not just read about it.

---

**Wednesday: HTTP in depth**

```bash
curl -v https://factorsphere.org            # verbose: see TLS handshake, headers, body
curl -I https://factorsphere.org            # HEAD request only β€” no body
curl -L https://factorsphere.org            # follow redirects
curl -X POST https://api.example.com/endpoint \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"key": "value"}'
curl -o /dev/null -s -w "%{http_code}\n" https://factorsphere.org  # just status code
```

Read every header in the `curl -v` output. Know what these mean:
- `Cache-Control: max-age=86400` β€” browser can cache for 24 hours
- `X-Forwarded-For` β€” original client IP when behind a proxy
- `Strict-Transport-Security` β€” forces HTTPS
- `CF-RAY` β€” Cloudflare request ID, useful for debugging Workers
- `Content-Encoding: gzip` β€” response is compressed

HTTP status codes you must know cold: 200, 201, 204, 301, 302, 304, 400, 401, 403, 404, 422, 429, 500, 502, 503, 504. Know the difference between 401 and 403, between 502 and 503.

TLS handshake β€” be able to describe: client hello β†’ server hello + certificate β†’ client verifies cert β†’ key exchange β†’ symmetric session established. Not cryptography depth, but the sequence.

```bash
openssl s_client -connect factorsphere.org:443    # inspect the TLS certificate
openssl x509 -in cert.pem -noout -dates          # check cert expiry
```

---

**Thursday: SSH advanced**

`~/.ssh/config` β€” create this file:
```
Host vps
    HostName <your-vps-ip>
    User deploy
    IdentityFile ~/.ssh/id_ed25519
    ServerAliveInterval 60

Host github.com
    IdentityFile ~/.ssh/id_ed25519_github
    User git
```

Now `ssh vps` instead of `ssh -i ~/.ssh/id_ed25519 deploy@<ip>`.

Local port forwarding β€” access a service on the VPS that isn't exposed publicly:
```bash
ssh -L 5432:localhost:5432 vps
# Now psql -h localhost -p 5432 connects to Postgres on the VPS
```

Remote port forwarding β€” expose a local service through the VPS (useful for demos):
```bash
ssh -R 8080:localhost:3000 vps
# Traffic to vps:8080 now forwards to your local machine's port 3000
```

ProxyJump β€” hop through a bastion:
```bash
ssh -J bastion.example.com internal-server.example.com
# Or in config:
# Host internal
#     ProxyJump bastion
```

SSH agent:
```bash
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_ed25519
ssh-add -l    # list loaded keys
```

---

**Friday–Saturday: Firewalls and network diagnostics**

UFW advanced:
```bash
ufw status numbered                     # numbered rules for deletion
ufw delete 3                            # delete rule 3
ufw allow from 10.0.0.0/8 to any port 5432  # Postgres from internal network only
ufw logging on
tail -f /var/log/ufw.log
```

iptables β€” you need to read it, not memorize it:
```bash
iptables -L -n -v           # list all rules with packet/byte counts
iptables -L INPUT -n -v     # just INPUT chain
iptables -t nat -L -n       # NAT table
```

UFW uses iptables underneath. When UFW allows port 80, it adds an iptables ACCEPT rule. This is how to answer "how does UFW work under the hood?"

nftables:
```bash
nft list ruleset    # view current rules
```

Network diagnostics:
```bash
ping -c 4 8.8.8.8                    # basic reachability
traceroute 8.8.8.8                   # hop-by-hop path
mtr 8.8.8.8                         # traceroute + ping combined, live
ss -s                                # socket statistics summary
ss -tnp state established            # established TCP connections
```

---

**Sunday: Build the networking diagnostics script**

This goes on GitHub as a real project.

`endpoint-monitor.py` β€” check a list of endpoints and report health:
```python
#!/usr/bin/env python3
"""
Endpoint health monitor β€” checks DNS, HTTP reachability, and SSL cert expiry.
Usage: python3 endpoint-monitor.py --config endpoints.yaml [--json]
"""
import argparse, json, socket, ssl, datetime, sys
import urllib.request, urllib.error
import yaml

def check_dns(hostname):
    try:
        ip = socket.gethostbyname(hostname)
        return {"status": "ok", "ip": ip}
    except socket.gaierror as e:
        return {"status": "error", "error": str(e)}

def check_http(url, timeout=10):
    try:
        req = urllib.request.Request(url, headers={"User-Agent": "endpoint-monitor/1.0"})
        with urllib.request.urlopen(req, timeout=timeout) as r:
            return {"status": "ok", "http_code": r.status, "latency_ms": None}
    except urllib.error.HTTPError as e:
        return {"status": "error", "http_code": e.code}
    except Exception as e:
        return {"status": "error", "error": str(e)}

def check_ssl(hostname, port=443):
    try:
        ctx = ssl.create_default_context()
        with ctx.wrap_socket(socket.create_connection((hostname, port), timeout=10),
                             server_hostname=hostname) as s:
            cert = s.getpeercert()
            expiry_str = cert['notAfter']
            expiry = datetime.datetime.strptime(expiry_str, "%b %d %H:%M:%S %Y %Z")
            days_left = (expiry - datetime.datetime.utcnow()).days
            return {"status": "ok", "expires": expiry_str, "days_remaining": days_left}
    except Exception as e:
        return {"status": "error", "error": str(e)}

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--config", required=True)
    parser.add_argument("--json", action="store_true")
    args = parser.parse_args()

    with open(args.config) as f:
        config = yaml.safe_load(f)

    results = {}
    for endpoint in config.get("endpoints", []):
        name = endpoint["name"]
        url = endpoint["url"]
        hostname = url.split("//")[-1].split("/")[0]
        results[name] = {
            "dns": check_dns(hostname),
            "http": check_http(url),
            "ssl": check_ssl(hostname) if url.startswith("https") else None,
        }

    if args.json:
        print(json.dumps(results, indent=2))
    else:
        for name, checks in results.items():
            print(f"\n{name}:")
            for check_name, result in checks.items():
                if result:
                    status = "βœ“" if result["status"] == "ok" else "βœ—"
                    print(f"  {status} {check_name}: {result}")

    any_failure = any(
        c["status"] == "error"
        for r in results.values()
        for c in r.values() if c
    )
    sys.exit(1 if any_failure else 0)

if __name__ == "__main__":
    main()
```

`endpoints.yaml`:
```yaml
endpoints:
  - name: FactorSphere
    url: https://factorsphere.org
  - name: FactorSphere API
    url: https://api.factorsphere.org
  - name: VPS
    url: http://<your-vps-ip>:8080
```

**Deliverables, Week 2:**
- `ops-scripts` repo updated: `endpoint-monitor.py` added with `endpoints.yaml.example` and updated README
- Can verbally answer in an interview: "Walk me through what happens when a user types factorsphere.org and hits Enter" β€” from DNS query through TLS through Cloudflare edge to Worker response
- VPS subdomain configured (if you have a domain) and verified with `dig`

---

### Week 3 β€” Python DevOps Scripting

**What this week unlocks:** Python automation is the most common take-home task format in NCR DevOps interviews. By the end of this week you have a GitHub repo with real scripts and can complete a take-home assignment in 2 hours rather than 4.

**Study hours: 42**

**Learning objectives:**
- `subprocess`: running shell commands from Python, capturing output, handling return codes
- `os`/`sys`: environment variables, path operations, argument handling
- `argparse`: building proper CLI tools with flags and help text
- `requests`: HTTP calls, error handling, timeouts, retries
- `yaml`/`json`: config parsing, output generation
- Writing scripts that do real infrastructure work β€” deploy checks, API monitors, config validators, log parsers

---

**Monday–Tuesday: subprocess + os + sys + argparse**

`subprocess` β€” the right way:
```python
import subprocess

# Run a command, capture output, check return code
result = subprocess.run(
    ["systemctl", "is-active", "nginx"],
    capture_output=True,
    text=True,
    timeout=10
)
print(result.stdout.strip())   # "active" or "inactive"
print(result.returncode)       # 0 = active, 3 = inactive

# Run shell command (avoid when possible β€” harder to escape safely)
result = subprocess.run(
    "df -h | grep '/$'",
    shell=True, capture_output=True, text=True
)

# Raise on non-zero exit
subprocess.run(["docker", "ps"], check=True)  # raises CalledProcessError if docker fails
```

`os` and `sys`:
```python
import os, sys

# Environment variables
api_key = os.environ.get("CF_API_KEY")  # returns None if not set, no KeyError
api_key = os.environ["CF_API_KEY"]      # raises KeyError if not set β€” use when required

# Paths
os.path.join("/var/log", "nginx", "access.log")
os.path.exists("/etc/nginx/nginx.conf")
os.path.abspath("../config")

# Script directory (useful for loading config files relative to script)
script_dir = os.path.dirname(os.path.abspath(__file__))
config_path = os.path.join(script_dir, "config.yaml")

# Exit with status code (important for shell scripts that call your Python)
sys.exit(0)   # success
sys.exit(1)   # failure
```

`argparse` β€” build a real CLI:
```python
import argparse

def main():
    parser = argparse.ArgumentParser(
        description="Check service health on a VPS",
        formatter_class=argparse.RawDescriptionHelpFormatter
    )
    parser.add_argument("--host", required=True, help="VPS hostname or IP")
    parser.add_argument("--port", type=int, default=8080, help="Health endpoint port")
    parser.add_argument("--service", action="append", dest="services",
                        help="systemd service to check (repeat for multiple)")
    parser.add_argument("--verbose", "-v", action="store_true")
    parser.add_argument("--output-format", choices=["text", "json"], default="text")
    args = parser.parse_args()
    # args.host, args.port, args.services, args.verbose, args.output_format
```

Build `deploy-check.py` β€” takes a service name as argument, checks it's active, verifies the port is listening, hits the health endpoint, prints pass/fail with exit code:
```python
#!/usr/bin/env python3
"""
Post-deploy smoke test: checks systemd service, port, and HTTP health endpoint.
Usage: python3 deploy-check.py --service nginx --port 80 --endpoint /health
"""
import argparse, subprocess, socket, sys
import requests

def check_service(name):
    r = subprocess.run(["systemctl", "is-active", name],
                       capture_output=True, text=True)
    return r.stdout.strip() == "active"

def check_port(port, host="localhost"):
    try:
        with socket.create_connection((host, port), timeout=5):
            return True
    except (socket.timeout, ConnectionRefusedError, OSError):
        return False

def check_http(url, timeout=10):
    try:
        r = requests.get(url, timeout=timeout)
        return r.status_code < 500, r.status_code
    except requests.exceptions.RequestException as e:
        return False, str(e)

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--service", required=True)
    parser.add_argument("--port", type=int, required=True)
    parser.add_argument("--endpoint", default="/health")
    parser.add_argument("--host", default="localhost")
    args = parser.parse_args()

    checks = [
        ("service active", check_service(args.service)),
        ("port listening", check_port(args.port, args.host)),
    ]
    http_ok, http_detail = check_http(f"http://{args.host}:{args.port}{args.endpoint}")
    checks.append((f"HTTP {args.endpoint}", http_ok))

    passed = all(ok for _, ok in checks)
    for name, ok in checks:
        status = "PASS" if ok else "FAIL"
        print(f"[{status}] {name}")

    sys.exit(0 if passed else 1)

if __name__ == "__main__":
    main()
```

---

**Wednesday: requests + API interaction**

```python
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

# Basic
r = requests.get("https://api.example.com/status", timeout=10)
r.raise_for_status()   # raises HTTPError for 4xx/5xx
data = r.json()

# With headers
r = requests.get(
    "https://api.cloudflare.com/client/v4/accounts",
    headers={"Authorization": f"Bearer {os.environ['CF_API_TOKEN']}"},
    timeout=10
)

# Retry logic
session = requests.Session()
retry = Retry(total=3, backoff_factor=1, status_forcelist=[500, 502, 503, 504])
adapter = HTTPAdapter(max_retries=retry)
session.mount("https://", adapter)
r = session.get("https://api.example.com", timeout=10)
```

Build `cloudflare-deploy-monitor.py` β€” checks the status of the most recent FactorSphere Workers deployment via the Cloudflare API. Reads `CF_API_TOKEN` and `CF_ACCOUNT_ID` from environment variables. Outputs the deployment status and timestamp. Returns exit code 1 if the last deployment failed. This is a real script that interacts with a real production system.

Cloudflare API endpoint to use: `GET /accounts/{account_id}/workers/scripts/{script_name}/deployments` (check current CF API docs for exact path β€” the concept is what matters here, the implementation requires your actual credentials).

---

**Thursday: YAML/JSON parsing**

```python
import yaml, json

# YAML
with open("config.yaml") as f:
    config = yaml.safe_load(f)   # safe_load, not load β€” avoids arbitrary code execution

# JSON
with open("output.json") as f:
    data = json.load(f)

# Pretty print
print(json.dumps(data, indent=2, default=str))   # default=str handles datetime objects

# Write YAML
with open("generated-config.yaml", "w") as f:
    yaml.dump(config, f, default_flow_style=False)
```

Build `config-validator.py` β€” reads a YAML deployment config, validates required keys exist and have the right types, outputs errors if invalid, exits 0 if valid:
```python
REQUIRED_FIELDS = {
    "service_name": str,
    "image": str,
    "port": int,
    "health_endpoint": str,
    "environment": dict,
}
```

This is a real pattern β€” CI pipelines often validate config before proceeding.

---

**Friday–Saturday: Full ops script**

Build `vps-health-reporter.py` β€” this is the Week 3 anchor deliverable. It's a single script that does everything:

```
Usage: python3 vps-health-reporter.py [--json] [--verbose] [--output FILE]

Checks:
  - Disk usage per filesystem (warns if > configurable threshold)
  - Memory usage
  - systemd services (configured list)
  - Port reachability (configured list)
  - HTTP endpoints (configured list with expected status codes)
  - SSL cert expiry for HTTPS endpoints (warns if < 30 days)

Output:
  - Text table by default
  - JSON with --json flag
  - Writes to file with --output flag
  - Exit code 0 if all checks pass, 1 if any fail

Config: reads from vps-health-reporter.yaml:

services:
  - nginx
  - healthapi
  - docker
ports:
  - host: localhost
    port: 80
    name: nginx-http
  - host: localhost
    port: 8080
    name: healthapi
endpoints:
  - url: http://localhost:8080
    expected_status: 200
    name: healthapi-root
  - url: https://factorsphere.org
    expected_status: 200
    name: factorsphere
disk_threshold_pct: 80
memory_threshold_pct: 90
ssl_warning_days: 30
```

This script uses `subprocess`, `requests`, `ssl`, `socket`, `yaml`, `json`, `argparse`, `sys`, `os`. It solves a real problem β€” paste it into any VPS and get a health report.

---

**Sunday: Polish and documentation**

`requirements.txt`:
```
requests==2.31.0
PyYAML==6.0
```

Meaningful commit history β€” if all your commits are `add scripts`, you're doing it wrong. Examples of correct commit messages:
```
feat(deploy-check): add HTTP health endpoint validation
fix(endpoint-monitor): handle SSL cert expiry for non-HTTPS endpoints gracefully
feat(vps-reporter): add configurable disk/memory thresholds from YAML config
refactor(cloudflare-monitor): extract API client to reusable class
```

README for the repo: what problem each script solves, prerequisites, installation (`pip install -r requirements.txt`), example usage for each script, example output.

**Deliverables, Week 3:**
- `ops-scripts` repo: 5+ scripts (`health-check.sh`, `backup.sh`, `log-analyzer.sh`, `endpoint-monitor.py`, `deploy-check.py`, `cloudflare-deploy-monitor.py`, `config-validator.py`, `vps-health-reporter.py`), `requirements.txt`, `endpoints.yaml.example`, `vps-health-reporter.yaml.example`, clean README, 20+ meaningful commits

---

### Week 4 β€” Docker (Production) + Docker Compose + Application Strategy

**What this week unlocks:** Docker and Compose are tested in almost every junior DevOps interview in NCR. By the end of this week you have a real multi-service stack running on the VPS β€” something you can show and explain in a technical round. Sunday is the application strategy session.

**Study hours: 42**

**Learning objectives:**
- Multi-stage Dockerfiles: how and why, not just what
- Image optimization: layer caching order, `.dockerignore`, minimal base images
- Container networking: how containers resolve each other by name in Compose
- Volume management: named volumes vs bind mounts, when each is appropriate
- Health checks: HEALTHCHECK instruction, container health states, `depends_on` conditions
- Running Docker Compose stacks as persistent services on VPS
- Docker Compose: full multi-service stack, `.env` files, override files, Makefile

---

**Monday–Tuesday: Production Dockerfile patterns**

You've used Docker. These are the patterns that distinguish junior from intern-level usage.

Multi-stage build for a Python app:
```dockerfile
# syntax=docker/dockerfile:1

# ── Stage 1: builder ──────────────────────────────────────────
FROM python:3.11-slim AS builder

WORKDIR /build

# Copy only requirements first β€” Docker caches this layer
# If requirements.txt doesn't change, this layer is reused on rebuild
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

# ── Stage 2: runtime ──────────────────────────────────────────
FROM python:3.11-slim AS runtime

# Create non-root user
RUN useradd --create-home --shell /bin/bash appuser

WORKDIR /app

# Copy only the installed packages from builder
COPY --from=builder /root/.local /home/appuser/.local
COPY --chown=appuser:appuser . .

USER appuser

# PATH must include user-installed packages
ENV PATH=/home/appuser/.local/bin:$PATH

EXPOSE 5000

HEALTHCHECK --interval=30s --timeout=10s --start-period=10s --retries=3 \
    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:5000/health')" || exit 1

CMD ["python", "-m", "gunicorn", "--bind", "0.0.0.0:5000", "--workers", "2", "app:app"]
```

`.dockerignore`:
```
.git
.gitignore
__pycache__
*.pyc
*.pyo
.pytest_cache
.env
*.env
node_modules
.venv
dist
build
*.egg-info
README.md
docs/
```

Why layer caching order matters: `COPY requirements.txt .` + `RUN pip install` before `COPY . .` means requirements are cached as long as `requirements.txt` doesn't change. If you `COPY . .` first, every file change invalidates the pip install layer. Run `docker build` twice β€” second run should show `CACHED` for the pip layer.

Compare image sizes:
```bash
docker images | grep myapp
# naive (python:3.11): ~1.1GB
# multi-stage (python:3.11-slim): ~200MB
# distroless: ~100MB
```

Container networking:
```bash
docker network create mynet
docker run -d --name db --network mynet postgres:15
docker run -it --network mynet alpine ping db   # resolves by container name
docker inspect mynet   # see connected containers and IP assignments
```

---

**Wednesday: Docker Compose**

Build a `docker-compose.yml` for a real multi-service stack:
```yaml
version: "3.9"

services:
  api:
    build:
      context: ./api
      dockerfile: Dockerfile
    image: myapi:latest
    container_name: myapi
    ports:
      - "5000:5000"
    environment:
      - DATABASE_URL=postgresql://postgres:${POSTGRES_PASSWORD}@db:5432/appdb
      - REDIS_URL=redis://cache:6379/0
      - SECRET_KEY=${SECRET_KEY}
    depends_on:
      db:
        condition: service_healthy
      cache:
        condition: service_started
    restart: unless-stopped
    networks:
      - backend

  db:
    image: postgres:15-alpine
    container_name: mydb
    environment:
      POSTGRES_DB: appdb
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./db/init.sql:/docker-entrypoint-initdb.d/init.sql:ro
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres -d appdb"]
      interval: 10s
      timeout: 5s
      retries: 5
    restart: unless-stopped
    networks:
      - backend

  cache:
    image: redis:7-alpine
    container_name: myredis
    command: redis-server --appendonly yes
    volumes:
      - redis_data:/data
    restart: unless-stopped
    networks:
      - backend

volumes:
  postgres_data:
  redis_data:

networks:
  backend:
    driver: bridge
```

`.env` (never commit this β€” commit `.env.example`):
```
POSTGRES_PASSWORD=changeme_in_production
SECRET_KEY=changeme_in_production
```

Override for development β€” `docker-compose.override.yml` (only loaded locally, not in CI):
```yaml
services:
  api:
    volumes:
      - ./api:/app   # bind mount for hot reload in dev
    environment:
      - DEBUG=true
```

Production doesn't have the override file, so bind mounts don't exist in prod.

---

**Thursday: Containers as systemd services + resource limits**

Running Docker Compose as a systemd service on the VPS:

`/etc/systemd/system/myapp.service`:
```ini
[Unit]
Description=MyApp Docker Compose Stack
Requires=docker.service
After=docker.service network-online.target
Wants=network-online.target

[Service]
Type=oneshot
RemainAfterExit=yes
WorkingDirectory=/opt/myapp
ExecStart=/usr/bin/docker compose up -d --remove-orphans
ExecStop=/usr/bin/docker compose down
TimeoutStartSec=120
TimeoutStopSec=30
User=deploy

[Install]
WantedBy=multi-user.target
```

Resource limits in Compose:
```yaml
services:
  api:
    deploy:
      resources:
        limits:
          cpus: "0.5"
          memory: 256M
        reservations:
          memory: 128M
```

Log management β€” prevent containers from filling your disk:
```yaml
services:
  api:
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"
```

---

**Friday–Saturday: Build the anchor project β€” vps-multiservice**

This is the Week 4 portfolio project. Build it properly.

**Project: Multi-service API stack on VPS**

Repository: `vps-multiservice`

Structure:
```
vps-multiservice/
β”œβ”€β”€ api/
β”‚   β”œβ”€β”€ app.py
β”‚   β”œβ”€β”€ Dockerfile
β”‚   └── requirements.txt
β”œβ”€β”€ db/
β”‚   └── init.sql
β”œβ”€β”€ nginx/                    # placeholder config β€” Week 5 replaces this
β”‚   └── default.conf
β”œβ”€β”€ docker-compose.yml
β”œβ”€β”€ docker-compose.override.yml
β”œβ”€β”€ .env.example
β”œβ”€β”€ Makefile
└── README.md
```

`api/app.py` β€” a real Flask API, not hello-world:
```python
from flask import Flask, jsonify
import psycopg2, redis, os, time

app = Flask(__name__)

def get_db():
    return psycopg2.connect(os.environ["DATABASE_URL"])

def get_redis():
    url = os.environ.get("REDIS_URL", "redis://cache:6379/0")
    return redis.from_url(url)

@app.route("/health")
def health():
    checks = {}
    # Check Postgres
    try:
        conn = get_db()
        conn.close()
        checks["database"] = "ok"
    except Exception as e:
        checks["database"] = f"error: {e}"
    # Check Redis
    try:
        r = get_redis()
        r.ping()
        checks["cache"] = "ok"
    except Exception as e:
        checks["cache"] = f"error: {e}"

    all_ok = all(v == "ok" for v in checks.values())
    return jsonify({"status": "ok" if all_ok else "degraded", "checks": checks}), \
           200 if all_ok else 503

@app.route("/info")
def info():
    return jsonify({
        "service": "vps-multiservice-api",
        "version": os.environ.get("APP_VERSION", "dev"),
        "uptime": time.time()
    })

@app.route("/cache/set/<key>/<value>")
def cache_set(key, value):
    r = get_redis()
    r.setex(key, 300, value)
    return jsonify({"stored": key})

@app.route("/cache/get/<key>")
def cache_get(key):
    r = get_redis()
    value = r.get(key)
    return jsonify({"key": key, "value": value.decode() if value else None})

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000)
```

`Makefile`:
```makefile
.PHONY: up down logs ps build restart deploy

up:
	docker compose up -d --build

down:
	docker compose down

logs:
	docker compose logs -f

ps:
	docker compose ps

build:
	docker compose build --no-cache

restart:
	docker compose restart

health:
	curl -s http://localhost:5000/health | python3 -m json.tool

shell-api:
	docker compose exec api /bin/bash

shell-db:
	docker compose exec db psql -U postgres -d appdb
```

`README.md` must answer: what this project is, what services it runs, how to start it locally, how to deploy to production, what the health endpoint checks, what environment variables are required, what the architecture looks like (diagram), and what decisions were made (why named volumes over bind mounts, why `service_healthy` condition on `depends_on`, why non-root user in Dockerfile).

Deploy to the Hetzner VPS: clone the repo, create `.env` from `.env.example`, `make up`, verify `make health` returns 200. Configure the systemd service so it starts on boot.

---

**Sunday: Application strategy**

This is the only Sunday in Phase 1 not dedicated to pure technical work. Block the full day.

**Resume β€” one page, PDF:**

Header: your name, email, GitHub URL, LinkedIn URL, location (Delhi NCR), phone.

Title: *Junior DevOps & Infrastructure Engineer*

Experience section:
```
Software Developer Intern β€” [Antler-backed venture studio]  [dates]
β€’ Shipped production SaaS MVPs across multiple projects; owned Docker-based
  deployment pipelines and CI/CD configuration for 3+ products
β€’ Implemented GitHub Actions workflows for automated testing and deployment
β€’ Worked across time zones with distributed team
```

Projects section (this is more important than the internship for a DevOps role):
```
FactorSphere (factorsphere.org) β€” Production Edge Platform
β€’ Live academic journal ranking platform with 4,000+ journals, real users,
  Google-indexed; won college project exhibition
β€’ Serverless microservices on Cloudflare Workers (analogous to AWS Lambda +
  API Gateway), Pinecone vector database, LLM inference pipeline
β€’ CI/CD: GitHub Actions β†’ Wrangler β†’ Cloudflare edge deployment
β€’ Architecture: fully edge-native; documented in ARCHITECTURE.md on GitHub

VPS Multi-Service Stack (github.com/you/vps-multiservice)
β€’ Multi-service API stack on Ubuntu VPS: Flask + Redis + PostgreSQL
β€’ Multi-stage Docker builds, Docker Compose, named volumes, health checks
β€’ systemd service management, UFW firewall configuration, SSH hardening
```

Skills section:
```
Infrastructure: Linux (Ubuntu/Arch), Docker, Docker Compose, systemd, UFW, SSH
Scripting: Python (subprocess, requests, argparse, YAML/JSON), Bash
CI/CD: GitHub Actions, Wrangler (Cloudflare)
Networking: DNS, HTTP/S, TCP/IP, SSL/TLS, reverse proxy concepts
Observability: log analysis (journalctl), endpoint monitoring
Platforms: Cloudflare Workers/Pages, Pinecone, Hetzner VPS
Version Control: Git (branching, rebasing, structured commits)
```

Do not list technologies you can't defend in a 5-minute conversation. If Kubernetes is not on your CV, a recruiter won't ask about it. If it is, they will.

**LinkedIn:**
- Headline: *Junior DevOps Engineer | Cloudflare Workers | Docker | Python | Delhi NCR*
- About section: 3 sentences. "CS grad with production experience at an Antler-backed venture studio. Built and deployed FactorSphere (factorsphere.org), a live platform running on a serverless edge architecture. Targeting junior DevOps and cloud infrastructure roles in Delhi NCR."
- Add all projects with links
- Enable "Open to Work" (visible to recruiters, not your network if you prefer)

**Job titles to search:**
- "DevOps Engineer" + fresher/junior/0-2 years
- "Cloud Support Engineer"
- "Infrastructure Engineer"
- "Site Reliability Engineer" (rare at fresher level but exists)
- "Systems Engineer" (often infrastructure work at service companies)

**Platforms, in priority order:**
1. LinkedIn β€” set job alerts for each title + Delhi, Gurgaon, Noida
2. Naukri.com β€” mandatory for NCR; service companies and GCCs post exclusively here
3. Instahyre β€” funded product companies
4. Wellfound (AngelList) β€” funded startups
5. Company career pages directly: Nagarro, GlobalLogic, Publicis Sapient, Genpact, NIIT Technologies, Mphasis

**Outreach message template (LinkedIn β€” max 5 lines):**
> Hi [Name] β€” I'm a CS grad with production experience shipping SaaS MVPs at an Antler-backed venture studio, including a live platform (factorsphere.org) running on a serverless edge architecture with Docker, CI/CD, and Python scripting across projects. I'm targeting junior DevOps roles in NCR and noticed [Company] works on [relevant tech or cloud platform from their JD]. Would it be appropriate to share my CV directly?

Send this to: DevOps leads, engineering managers, or HR at target companies. Not recruiters at staffing agencies (waste of time for this profile). Target 10 outreach messages in the first week of applying.

**By Sunday evening, completed:**
- Resume PDF finalized
- LinkedIn updated
- Naukri profile created with correct resume
- 5 job applications submitted
- 5 outreach messages sent on LinkedIn

**Deliverables, Week 4:**
- `vps-multiservice` repo: running on VPS, full README, Makefile, `.env.example`, meaningful commit history showing incremental development
- `ops-scripts` repo: polished with all Week 1–3 scripts, clean README
- `endpoint-monitor.py` and `vps-health-reporter.py` running and documented
- FactorSphere: `docs/ARCHITECTURE.md` and `docs/CICD.md` committed
- Resume PDF (one page)
- LinkedIn updated
- 5+ applications submitted, 5+ outreach messages sent
- Job alert set on LinkedIn and Naukri

---

## PHASE 2: Weeks 5–12 β€” Become Hireable by Stronger Companies

---

### Week 5 β€” Nginx

**What this week unlocks:** Reverse proxy knowledge is expected at junior level. SSL termination and upstream configuration come up in almost every DevOps technical round. You also get HTTPS on your VPS project β€” a visible signal of operational maturity.

**Study hours: 42**

**Learning objectives:** Virtual hosts, reverse proxy, SSL termination with Let's Encrypt, load balancing upstream blocks, rate limiting, security headers, static file serving

---

**Monday: Installation and configuration structure**

```bash
apt install nginx
systemctl enable nginx
systemctl start nginx
nginx -v
```

Configuration structure on Ubuntu:
- `/etc/nginx/nginx.conf` β€” main config, defines `worker_processes`, `events`, and the `http` block
- `/etc/nginx/sites-available/` β€” your server blocks live here
- `/etc/nginx/sites-enabled/` β€” symlinks to sites-available for active configs
- `nginx -t` β€” test config syntax; always run before `systemctl reload nginx`
- `systemctl reload nginx` β€” graceful reload (no dropped connections) vs `restart` (all connections dropped)

Read `/etc/nginx/nginx.conf`. Understand: `worker_processes auto` uses all cores, `worker_connections 1024` limits connections per worker, the `include /etc/nginx/sites-enabled/*` line.

---

**Tuesday–Wednesday: Reverse proxy + SSL**

Create `/etc/nginx/sites-available/myapp`:
```nginx
server {
    listen 80;
    server_name vps.yourdomain.com;

    location / {
        proxy_pass http://127.0.0.1:5000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_connect_timeout 30s;
        proxy_read_timeout 30s;
        proxy_buffering on;
    }

    location /health {
        proxy_pass http://127.0.0.1:5000/health;
        access_log off;   # don't pollute logs with health checks
    }
}
```

```bash
ln -s /etc/nginx/sites-available/myapp /etc/nginx/sites-enabled/
nginx -t
systemctl reload nginx
```

SSL with Certbot:
```bash
apt install certbot python3-certbot-nginx
certbot --nginx -d vps.yourdomain.com
# Follow prompts: enter email, agree to TOS, choose redirect HTTP→HTTPS
certbot renew --dry-run   # verify auto-renewal works
```

After Certbot runs, inspect `/etc/nginx/sites-available/myapp` β€” Certbot modifies it. The config now has a `listen 443 ssl` block and an HTTP-to-HTTPS redirect. Read and understand what was added.

Auto-renewal is handled by a systemd timer installed by Certbot: `systemctl status certbot.timer`.

---

**Thursday: Multiple virtual hosts + static files**

Three server blocks:
1. `docs.yourdomain.com` β†’ serves static HTML files from `/var/www/docs/`
2. `api.yourdomain.com` β†’ reverse proxy to Flask API on port 5000
3. Default catch-all β†’ returns 444 (nginx closes connection without response)

Static file serving:
```nginx
server {
    listen 443 ssl;
    server_name docs.yourdomain.com;
    # (SSL certs added by certbot)

    root /var/www/docs;
    index index.html;

    location / {
        try_files $uri $uri/ =404;
    }

    # Cache static assets
    location ~* \.(css|js|png|jpg|ico)$ {
        expires 30d;
        add_header Cache-Control "public, immutable";
    }
}
```

Default catch-all:
```nginx
server {
    listen 80 default_server;
    listen 443 ssl default_server;
    server_name _;
    ssl_certificate /etc/ssl/certs/ssl-cert-snakeoil.pem;
    ssl_certificate_key /etc/ssl/private/ssl-cert-snakeoil.key;
    return 444;
}
```

---

**Friday–Saturday: Load balancing + rate limiting + security headers**

Upstream block for load balancing:
```nginx
upstream api_backends {
    least_conn;   # route to server with fewest active connections
    server 127.0.0.1:5000 weight=2;
    server 127.0.0.1:5001 weight=1;   # start a second Flask instance for this exercise
    keepalive 32;
}

server {
    location / {
        proxy_pass http://api_backends;
    }
}
```

Rate limiting:
```nginx
# In http block (nginx.conf or a conf.d include):
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;
limit_req_zone $binary_remote_addr zone=health_limit:1m rate=60r/m;

# In server block:
location /api/ {
    limit_req zone=api_limit burst=20 nodelay;
    proxy_pass http://api_backends;
}
```

Security headers:
```nginx
add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-XSS-Protection "1; mode=block" always;
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
add_header Referrer-Policy "strict-origin-when-cross-origin" always;
```

---

**Sunday: Update vps-multiservice**

Replace the placeholder Nginx container in Docker Compose with a reference to the host Nginx. Update `README.md`:
- Architecture diagram showing: internet β†’ Nginx (host, SSL) β†’ Docker network β†’ Flask API container
- SSL configuration documented
- How to reproduce (Certbot commands, nginx site config)

Update the `vps-multiservice` README with a "Production Deployment" section showing the full stack.

---

### Week 6 β€” GitHub Actions for VPS

**What this week unlocks:** CI/CD for VPS-hosted projects is the most visible resume signal. This pipeline demonstrates that your Docker Compose stack is managed like production infrastructure, not run manually. Combined with FactorSphere's existing CI/CD, you can now speak to two different CI/CD patterns.

**Study hours: 42**

---

**Monday–Tuesday: GitHub Actions syntax**

Create `.github/workflows/deploy.yml` in `vps-multiservice`:
```yaml
name: Deploy to VPS

on:
  push:
    branches: [main]
  workflow_dispatch:   # allow manual trigger from GitHub UI

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Validate docker-compose
        run: docker compose config

  deploy:
    needs: test
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4

      - name: Deploy to VPS
        uses: appleboy/ssh-action@v1.0.0
        with:
          host: ${{ secrets.VPS_HOST }}
          username: deploy
          key: ${{ secrets.VPS_SSH_KEY }}
          script: |
            cd /opt/myapp
            git pull origin main
            docker compose pull
            docker compose up -d --build --remove-orphans
            docker system prune -f

      - name: Smoke test
        run: |
          sleep 10
          curl -f https://vps.yourdomain.com/health || exit 1
```

GitHub Secrets to configure (Settings β†’ Secrets and Variables β†’ Actions):
- `VPS_HOST`: your VPS IP
- `VPS_SSH_KEY`: contents of a dedicated deploy private key (generate a new `ed25519` key specifically for GitHub Actions, add the public key to `~/.ssh/authorized_keys` on the VPS)

Never put the private key in the repo or hardcode it in the workflow.

---

**Wednesday–Thursday: Zero-downtime consideration + notifications**

The naive `docker compose up -d --build` restarts containers, causing brief downtime. For a portfolio project this is acceptable. Document this limitation in the README and explain what a production mitigation looks like (blue-green deployment, rolling update in Kubernetes, or a health-check grace period with a load balancer).

Add failure notification via GitHub's built-in email (no setup required β€” GitHub emails you when a workflow fails on default).

Add a status badge to the README:
```markdown
![Deploy](https://github.com/yourusername/vps-multiservice/actions/workflows/deploy.yml/badge.svg)
```

---

**Friday–Saturday: Makefile deployment tooling**

Add to the `Makefile`:
```makefile
deploy:
	@echo "Deploying to VPS..."
	ssh deploy@$(VPS_HOST) "cd /opt/myapp && git pull && docker compose up -d --build"

rollback:
	@echo "Rolling back to previous image..."
	ssh deploy@$(VPS_HOST) "cd /opt/myapp && git checkout HEAD~1 && docker compose up -d"

status:
	ssh deploy@$(VPS_HOST) "docker compose ps && curl -s http://localhost:5000/health"

logs-prod:
	ssh deploy@$(VPS_HOST) "docker compose logs -f --tail=100"
```

Usage with `make VPS_HOST=<ip> deploy` or set VPS_HOST in a local `.env.make` (not committed).

---

**Sunday: FactorSphere CI/CD comparison**

Update `docs/CICD.md` in FactorSphere to add a comparison section:

```
## CI/CD Pattern Comparison

FactorSphere (Cloudflare Workers):
  Trigger: push to main
  Pipeline: GitHub Actions β†’ wrangler deploy β†’ Cloudflare edge network
  Rollback: wrangler rollback <deployment-id> or git revert + push
  State: stateless Workers, no server to manage

vps-multiservice (Docker Compose + SSH):
  Trigger: push to main
  Pipeline: GitHub Actions β†’ SSH β†’ git pull β†’ docker compose up
  Rollback: git checkout HEAD~1 + docker compose up (or tag-based rollback)
  State: stateful services (Postgres data in named volume), must manage carefully
```

Being able to explain this comparison β€” why each pattern exists, what tradeoffs each makes β€” is a strong interview signal.

---

### Weeks 7–9 β€” AWS (EC2, IAM, S3, VPC, CloudWatch)

Three weeks of AWS. You are building toward a single coherent project: the same multi-service stack, now running on EC2, with IAM roles, S3 backups, VPC networking, and CloudWatch logging.

---

### Week 7 β€” EC2 + IAM

**What this week unlocks:** AWS is on your CV with real hands-on evidence. Most NCR product companies and GCCs require at least basic AWS. After this week you can pass a cloud support phone screen.

**Learning objectives:** EC2 launch and management, SSH key pairs, security groups, IAM users/roles/policies, principle of least privilege, AWS CLI, instance profiles

---

**Monday–Tuesday: AWS account + IAM baseline**

Create an AWS account (personal email, not institutional). Immediately:
1. Enable MFA on the root account
2. Create a billing alarm: CloudWatch β†’ Alarms β†’ Create Alarm β†’ Billing β†’ Total Estimated Charge β†’ threshold $5 β†’ email notification. This protects against accidentally leaving expensive resources running.
3. Create an IAM user for yourself: `yourname-admin`, `AdministratorAccess` policy, programmatic + console access, enable MFA
4. Never use root again

Configure AWS CLI:
```bash
aws configure
# AWS Access Key ID: [from IAM user]
# AWS Secret Access Key: [from IAM user]
# Default region: ap-south-1   (Mumbai β€” lowest latency from Delhi)
# Default output format: json

aws sts get-caller-identity   # verify who you're authenticated as
aws iam get-user              # verify correct user
```

IAM β€” understand these three things cold:
- **User**: a person or application with long-term credentials (access keys)
- **Role**: an identity assumed temporarily; no long-term credentials; assumed by EC2 instances, Lambda, other services
- **Policy**: a JSON document that defines what actions are allowed on which resources

Create a principle of least privilege policy:
```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::my-specific-bucket",
        "arn:aws:s3:::my-specific-bucket/*"
      ]
    }
  ]
}
```

This is more secure than `"Action": "s3:*"` and more secure than `"Resource": "*"`. Be ready to explain why.

---

**Wednesday–Thursday: EC2 launch and management**

Launch a `t2.micro` (free tier) with Ubuntu 22.04 in ap-south-1:
- AMI: Ubuntu 22.04 LTS (search "ubuntu 22.04 hvm" in Community AMIs or use the Quick Start)
- Instance type: t2.micro (free tier eligible)
- Key pair: create a new key pair named `ec2-deploy-key`, download the `.pem`
- Security group: create `web-sg`
  - Inbound: SSH (22) from your IP only (not 0.0.0.0/0), HTTP (80) from 0.0.0.0/0, HTTPS (443) from 0.0.0.0/0
  - Outbound: all traffic (default)
- Storage: 8GB gp3 (free tier)

SSH in:
```bash
chmod 400 ec2-deploy-key.pem
ssh -i ec2-deploy-key.pem ubuntu@<ec2-public-ip>
```

Install Docker on the EC2:
```bash
apt update
apt install -y docker.io docker-compose-plugin
usermod -aG docker ubuntu
systemctl enable docker
systemctl start docker
```

User data bootstrap β€” relaunch with a user data script that installs Docker automatically. In the EC2 launch wizard, under "Advanced Details" β†’ "User data":
```bash
#!/bin/bash
apt-get update -y
apt-get install -y docker.io docker-compose-plugin git
usermod -aG docker ubuntu
systemctl enable docker
systemctl start docker
```

This is how you automate instance bootstrap β€” important concept for EC2.

---

**Friday–Saturday: IAM role + instance profile**

Create an IAM role for EC2 to access S3:
1. IAM β†’ Roles β†’ Create role
2. Trusted entity: AWS service β†’ EC2
3. Policy: create inline policy:
```json
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": ["s3:PutObject", "s3:GetObject", "s3:ListBucket"],
    "Resource": [
      "arn:aws:s3:::your-backup-bucket",
      "arn:aws:s3:::your-backup-bucket/*"
    ]
  }]
}
```
4. Name the role `ec2-app-role`
5. Attach the role to your EC2 instance: EC2 β†’ Actions β†’ Security β†’ Modify IAM role

Now on the EC2, without any credentials:
```bash
aws s3 ls s3://your-backup-bucket   # works because instance has IAM role
aws sts get-caller-identity          # shows the assumed-role identity
```

This is the correct way to give EC2 access to AWS services β€” not by putting access keys on the instance. Be ready to explain why: if the instance is compromised, long-term credentials in environment variables are exfiltrated; a role is temporary and scoped.

Deploy `vps-multiservice` to EC2: clone the repo, create `.env`, `docker compose up -d`. Verify `curl http://<ec2-ip>:5000/health` returns 200.

---

### Week 8 β€” S3 + VPC

**Learning objectives:** S3 bucket operations, IAM policies for S3, static hosting, lifecycle policies, presigned URLs, boto3; VPC fundamentals β€” subnets, route tables, internet gateway, security groups vs NACLs

---

**Monday–Tuesday: S3**

```bash
# Create bucket
aws s3 mb s3://your-backup-bucket-$(date +%s) --region ap-south-1

# Upload and download
aws s3 cp backup.tar.gz s3://your-backup-bucket/backups/backup.tar.gz
aws s3 sync ./logs/ s3://your-backup-bucket/logs/
aws s3 ls s3://your-backup-bucket/ --recursive

# Static website hosting
aws s3 mb s3://your-docs-site
aws s3 website s3://your-docs-site --index-document index.html --error-document 404.html
aws s3 sync ./docs/ s3://your-docs-site --acl public-read
```

Lifecycle policy (JSON):
```json
{
  "Rules": [{
    "ID": "archive-old-backups",
    "Status": "Enabled",
    "Filter": {"Prefix": "backups/"},
    "Transitions": [{
      "Days": 30,
      "StorageClass": "GLACIER"
    }],
    "Expiration": {"Days": 365}
  }]
}
```

```bash
aws s3api put-bucket-lifecycle-configuration \
  --bucket your-backup-bucket \
  --lifecycle-configuration file://lifecycle.json
```

Presigned URLs with Python (boto3):
```python
import boto3, os

s3 = boto3.client("s3", region_name="ap-south-1")

# Generate URL that expires in 1 hour
url = s3.generate_presigned_url(
    "get_object",
    Params={"Bucket": "your-backup-bucket", "Key": "backups/backup.tar.gz"},
    ExpiresIn=3600
)
print(url)  # anyone with this URL can download for 1 hour
```

Update `backup.sh` to upload to S3 after creating the local archive:
```bash
# At end of backup.sh:
if command -v aws &>/dev/null; then
    aws s3 cp "$ARCHIVE" "s3://${S3_BUCKET}/backups/$(basename "$ARCHIVE")" && \
        echo "Uploaded to S3: s3://${S3_BUCKET}/backups/$(basename "$ARCHIVE")"
fi
```

---

**Wednesday–Thursday: VPC**

The default VPC works for most things. Understanding custom VPCs is the interview signal.

Create a custom VPC:
1. VPC β†’ Create VPC
   - Name: `app-vpc`
   - CIDR: `10.0.0.0/16`
2. Create subnets:
   - Public subnet: `10.0.1.0/24`, AZ: ap-south-1a
   - Private subnet: `10.0.2.0/24`, AZ: ap-south-1b
3. Create internet gateway, attach to `app-vpc`
4. Route table for public subnet: add route `0.0.0.0/0 β†’ internet gateway`
5. Route table for private subnet: no internet route (private)
6. Launch EC2 in public subnet β€” it gets a public IP and internet access
7. Understand: an EC2 in private subnet has no internet unless you add a NAT gateway (costs ~$0.045/hour β€” don't provision, just understand)

The interview question version: "What's the difference between a public and private subnet in AWS?" Answer: a public subnet has a route to an internet gateway; a private subnet does not. Resources in a private subnet can reach the internet via a NAT gateway but cannot be reached from the internet directly.

---

**Friday–Saturday: Integrate S3 into the CI/CD pipeline**

Update `.github/workflows/deploy.yml` in `vps-multiservice`:
```yaml
- name: Archive previous version to S3
  run: |
    aws s3 cp docker-compose.yml \
      s3://your-backup-bucket/deployments/$(date +%Y%m%d_%H%M%S)/docker-compose.yml
  env:
    AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
    AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
    AWS_DEFAULT_REGION: ap-south-1
```

This demonstrates integration of S3 into a CI/CD pipeline. The backup is a real operational artifact β€” it means you can reconstruct what was deployed at any given time.

---

### Week 9 β€” CloudWatch

**Learning objectives:** CloudWatch Logs (log groups, log streams, log agents), metric filters, alarms, custom metrics via boto3

---

**Monday–Tuesday: CloudWatch Logs**

Install CloudWatch agent on EC2:
```bash
wget https://s3.amazonaws.com/amazoncloudwatch-agent/ubuntu/amd64/latest/amazon-cloudwatch-agent.deb
dpkg -i amazon-cloudwatch-agent.deb
```

Configure `/opt/aws/amazon-cloudwatch-agent/bin/config.json`:
```json
{
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/nginx/access.log",
            "log_group_name": "vps-multiservice",
            "log_stream_name": "nginx-access"
          },
          {
            "file_path": "/var/log/nginx/error.log",
            "log_group_name": "vps-multiservice",
            "log_stream_name": "nginx-error"
          }
        ]
      }
    }
  }
}
```

```bash
/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
    -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -s
```

The EC2 IAM role needs CloudWatch permissions β€” add to the role:
```json
{
  "Effect": "Allow",
  "Action": ["cloudwatch:PutMetricData", "logs:*"],
  "Resource": "*"
}
```

---

**Wednesday–Thursday: Metric filters + alarms**

In the AWS console:
1. CloudWatch β†’ Log groups β†’ `vps-multiservice` β†’ nginx-access
2. Create metric filter: pattern `[ip, id, user, timestamp, request, status_code=5*, ...]`
3. Metric name: `nginx-5xx`, metric value: 1
4. Create alarm on this metric: if sum > 5 in 5 minutes β†’ SNS notification to your email

Billing alarm (if not done in Week 7):
```bash
aws cloudwatch put-metric-alarm \
    --alarm-name "billing-alert-5usd" \
    --alarm-description "Alert when charges exceed $5" \
    --metric-name EstimatedCharges \
    --namespace AWS/Billing \
    --statistic Maximum \
    --period 86400 \
    --threshold 5 \
    --comparison-operator GreaterThanThreshold \
    --dimensions Name=Currency,Value=USD \
    --evaluation-periods 1 \
    --alarm-actions <your-sns-topic-arn>
```

---

**Friday–Saturday: Custom metrics + Python integration**

Add to `vps-health-reporter.py` β€” a `--push-to-cloudwatch` flag:
```python
import boto3

def push_to_cloudwatch(metrics_dict, namespace="VPS/Health"):
    cw = boto3.client("cloudwatch", region_name="ap-south-1")
    metric_data = []
    for name, value in metrics_dict.items():
        metric_data.append({
            "MetricName": name,
            "Value": float(value),
            "Unit": "Percent" if "pct" in name.lower() else "Count"
        })
    if metric_data:
        cw.put_metric_data(Namespace=namespace, MetricData=metric_data)
        print(f"Pushed {len(metric_data)} metrics to CloudWatch")
```

Run this as a cron job on the EC2 every 5 minutes:
```bash
crontab -e
# Add:
*/5 * * * * /usr/bin/python3 /opt/scripts/vps-health-reporter.py --push-to-cloudwatch
```

In the CloudWatch console, verify the custom metrics appear under the `VPS/Health` namespace.

---

### Week 10 β€” Phase 2 Consolidation

**What this week unlocks:** You have AWS concretely on your CV. This week polishes everything, prepares interview answers for each project, and pushes the application cadence to 10+/week.

**Monday–Wednesday: Documentation and GitHub polish**

`vps-multiservice` README must now include:
- Architecture diagram: internet β†’ Route53 (or direct IP) β†’ Nginx (SSL) β†’ Docker network β†’ Flask/Redis/Postgres
- CI/CD flow: push to GitHub β†’ Actions β†’ SSH deploy to EC2 β†’ smoke test
- AWS resources used: EC2, IAM role, S3 (backups), CloudWatch (logs and metrics)
- How to reproduce: step-by-step from zero to running

`ops-scripts` README: add a section on S3 backup integration and CloudWatch metric publishing.

**Thursday–Friday: Interview preparation β€” talking about your projects**

For each project, prepare a 3-minute verbal answer to "tell me about this project":
- FactorSphere: "FactorSphere is a live academic journal ranking platform I built that serves real users. The architecture is fully edge-native β€” it runs on Cloudflare Workers rather than a traditional server, which means requests are processed at the edge closest to the user. The backend is a set of serverless microservices analogous to AWS Lambda β€” one for search, one for ranking, one for data processing. I integrated Pinecone as a managed vector database for semantic search and an LLM for query understanding. CI/CD is automated via GitHub Actions and Wrangler. I've documented the full pipeline in the repo."
- vps-multiservice: "This is a multi-service API stack I deployed on a VPS and on EC2. It runs Flask, Redis, and Postgres in Docker containers orchestrated by Docker Compose. I wrote multi-stage Dockerfiles to minimize image size, configured health checks so Compose knows when each service is ready before starting dependents, and set up Nginx as a reverse proxy with SSL termination. The deployment is automated via GitHub Actions β€” push to main triggers an SSH deploy to the server with a smoke test. Logs ship to CloudWatch and backups go to S3 via an IAM role."

**Saturday–Sunday: Application push**

By Week 10, increase to 10–15 applications/week. You now have AWS on your CV. Apply to roles that were out of reach at Week 4:
- "AWS Cloud Engineer (Junior/Fresher)"
- "Cloud Infrastructure Engineer"
- GCC roles that specify AWS in the JD

---

### Weeks 11–12 β€” Buffer, Deepening, and Interview Prep

**Week 11:** Take the weakest area β€” likely whichever of Linux/Docker/AWS you feel least confident explaining verbally β€” and go deeper. Run through interview scenarios: set up a broken Docker network and diagnose it, break an Nginx config and read the error, terminate an EC2 instance and relaunch from scratch without notes.

**Week 12:** Final project polish, README quality pass on all repos, continue applying. If interviews are happening, focus prep here on the specific company's stack.

---

## PHASE 3: Weeks 13–24 β€” Become Genuinely Competent

---

### Weeks 13–15 β€” Terraform

**What this unlocks:** Infrastructure as Code is expected at mid-level and increasingly mentioned in junior JDs. More importantly, after doing Weeks 7–9 by hand in the console and CLI, Terraform will click immediately β€” you already understand the resources, now you're just defining them declaratively.

**Week 13:** Terraform core concepts β€” providers, resources, state, plan/apply/destroy. Write Terraform that creates the Week 7/8 infrastructure: EC2, security groups, IAM role, S3 bucket. Run `terraform plan` to see what it would create. Run `terraform apply`. Verify it matches what you built manually.

```hcl
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
  backend "s3" {
    bucket = "your-terraform-state"
    key    = "vps-multiservice/terraform.tfstate"
    region = "ap-south-1"
  }
}

resource "aws_instance" "app" {
  ami           = "ami-0287a05f0ef0e9d9a"  # Ubuntu 22.04, ap-south-1
  instance_type = "t2.micro"
  key_name      = aws_key_pair.deploy.key_name
  subnet_id     = aws_subnet.public.id
  vpc_security_group_ids = [aws_security_group.web.id]
  iam_instance_profile   = aws_iam_instance_profile.app.name
  user_data = file("bootstrap.sh")

  tags = {
    Name    = "vps-multiservice"
    Project = "portfolio"
  }
}
```

**Week 14:** Terraform state, modules, variables, outputs. Write a module for the EC2 + security group pair so it's reusable. Store state remotely in S3 (with DynamoDB locking β€” use an S3 backend, DynamoDB is free at this scale).

**Week 15:** Terraform for the full Week 7–9 stack: EC2, VPC, subnets, route tables, internet gateway, IAM roles, S3 buckets, CloudWatch log group. The entire infrastructure is now in `infrastructure-iac` repo. `terraform apply` from scratch should produce a working environment.

**Repository:** `infrastructure-iac` β€” contains Terraform for all AWS infrastructure, README explaining what it creates and why, `.terraform.lock.hcl` committed, state backend configured, variables documented in `variables.tf`

---

### Weeks 16–17 β€” Prometheus + Grafana

**What this unlocks:** Monitoring stack demonstrates operational maturity beyond deployment. "Have you set up monitoring?" is a common interview question; being able to say yes with a GitHub repo is a strong differentiator at junior level.

**Week 16:** Deploy Prometheus and the Node Exporter on the Hetzner VPS (not EC2 β€” keep costs at zero here). Prometheus scrapes Node Exporter for system metrics (CPU, memory, disk, network). Scrape the Flask API's `/metrics` endpoint (add `prometheus-flask-exporter` to the app). Prometheus config:

```yaml
# /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: node
    static_configs:
      - targets: ['localhost:9100']

  - job_name: flask-api
    static_configs:
      - targets: ['localhost:5000']
```

Run everything in Docker Compose β€” add Prometheus and Node Exporter to the existing `docker-compose.yml` in `vps-multiservice`:
```yaml
prometheus:
  image: prom/prometheus:latest
  volumes:
    - ./prometheus.yml:/etc/prometheus/prometheus.yml
    - prometheus_data:/prometheus
  ports:
    - "9090:9090"

node_exporter:
  image: prom/node-exporter:latest
  network_mode: host
  pid: "host"
  volumes:
    - /proc:/host/proc:ro
    - /sys:/host/sys:ro
    - /:/rootfs:ro
```

**Week 17:** Deploy Grafana, connect it to Prometheus as a data source, build dashboards:
- System overview: CPU, memory, disk, network I/O
- Flask API: request rate, error rate, latency (p50, p95, p99)
- Alerts in Alertmanager: email when disk > 80% or API error rate > 5%

Export your dashboards as JSON and commit them to the repo. This means the monitoring stack is reproducible.

---

### Weeks 18–22 β€” Kubernetes

Five weeks because the surface area is large and the concepts require time to internalize.

**Week 18:** Kubernetes architecture β€” what a Node is, what a Pod is, what a Deployment is, how the control plane works (API server, scheduler, controller manager, etcd). Install `k3s` on a local KVM VM (not on the VPS β€” k3s with multiple services will use too much RAM on CX22):

```bash
curl -sfL https://get.k3s.io | sh -
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
kubectl get nodes
kubectl get pods -A
```

Write Kubernetes manifests for the Flask API:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: flask-api
spec:
  replicas: 2
  selector:
    matchLabels:
      app: flask-api
  template:
    metadata:
      labels:
        app: flask-api
    spec:
      containers:
      - name: flask-api
        image: myapi:latest
        ports:
        - containerPort: 5000
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: app-secrets
              key: database-url
        livenessProbe:
          httpGet:
            path: /health
            port: 5000
          initialDelaySeconds: 10
          periodSeconds: 30
        resources:
          requests:
            memory: "64Mi"
            cpu: "100m"
          limits:
            memory: "128Mi"
            cpu: "200m"
---
apiVersion: v1
kind: Service
metadata:
  name: flask-api
spec:
  selector:
    app: flask-api
  ports:
  - port: 80
    targetPort: 5000
```

**Weeks 19–20:** ConfigMaps, Secrets, Ingress (using k3s's built-in Traefik ingress controller), PersistentVolumes for Postgres. Deploy the full multi-service stack to Kubernetes.

**Weeks 21–22:** Rolling updates, rollbacks (`kubectl rollout undo deployment/flask-api`), Horizontal Pod Autoscaler, namespace isolation, RBAC basics. Write a GitHub Actions workflow that builds a Docker image, pushes to Docker Hub or GitHub Container Registry, and updates the k3s deployment via `kubectl set image`.

**Repository:** `k8s-fundamentals` β€” all manifests in `manifests/`, Kustomize or Helm intro in Week 22, README explaining architecture decisions

---

### Weeks 23–24 β€” Consolidation

Audit all repos: every README answers what it runs, how to run it, what decisions were made, and what you'd do differently with more time. This last part β€” "what I'd do differently" β€” is an advanced interview signal. Examples: "I'd add a proper secrets management solution (Vault or AWS Secrets Manager) instead of .env files", "I'd move Terraform state to a proper remote backend with team locking", "I'd add structured logging so Prometheus can ingest log metrics directly."

Write up a "portfolio narrative" β€” a one-page document (not on GitHub, for your own use) that ties everything together: FactorSphere shows edge-native architecture and LLM integration; vps-multiservice shows traditional server ops; the Terraform repo shows IaC; the monitoring stack shows observability; the Kubernetes repo shows container orchestration. This is your answer to "walk me through your projects" in a senior technical round.

---

## END SECTION

---

### 1. GitHub Repositories by Milestone

**Week 4 Repositories**

`ops-scripts` β€” bash health check, backup, and log analyzer scripts; Python endpoint monitor, deploy check, Cloudflare deploy monitor, config validator, and VPS health reporter. Why a recruiter cares: it demonstrates ability to automate real operational tasks β€” the most common take-home test format. Scripts that run on real infrastructure (not mock data) are distinguishable immediately.

`vps-multiservice` β€” Docker Compose stack (Flask + Redis + Postgres), multi-stage Dockerfile, `.env.example`, Makefile with operational targets, systemd service file, and README documenting architecture decisions. Why a recruiter cares: it's a real multi-service production-style stack running on actual infrastructure, not a tutorial clone. The health check endpoint that checks dependency connectivity is a concrete signal of operational thinking.

`factorsphere` (existing, updated) β€” `docs/ARCHITECTURE.md` and `docs/CICD.md` added. Why a recruiter cares: live product with real users transforms this from a class project into a production deployment story. The architecture documentation demonstrates that you understand what you built, not just that you ran commands.

**Week 8 Repositories**

`vps-multiservice` (updated) β€” now includes GitHub Actions CI/CD workflow deploying to both VPS and EC2, Nginx reverse proxy configuration, CloudWatch log shipping, S3 backup integration, and updated README with the full AWS architecture. A badge shows the live CI/CD status. Why a recruiter cares: this is a complete junior DevOps portfolio project covering Docker, CI/CD, AWS, monitoring, and backups β€” the canonical skill set for a junior role.

`ops-scripts` (updated) β€” S3 backup integration in `backup.sh`, CloudWatch metric publishing in `vps-health-reporter.py`, updated README. Why a recruiter cares: scripts that interact with real cloud services (boto3, AWS CLI) demonstrate practical cloud knowledge beyond "I know what S3 is."

**Week 24 Repositories**

`infrastructure-iac` β€” Terraform modules for the full AWS stack (EC2, VPC, subnets, security groups, IAM, S3). `terraform apply` from a clean account produces the entire `vps-multiservice` environment. Why a recruiter cares: IaC in any junior portfolio is unusual and signals maturity; Terraform specifically is the most common IaC tool in NCR DevOps JDs.

`monitoring-stack` β€” Prometheus, Grafana, and Alertmanager in Docker Compose, dashboard JSON files committed, Prometheus recording rules, Alertmanager config with email notifications. Why a recruiter cares: monitoring is the answer to "how do you know your service is healthy?" β€” most junior candidates can't demonstrate this.

`k8s-fundamentals` β€” Kubernetes manifests for the multi-service stack running on k3s, GitHub Actions pipeline that builds and deploys to the cluster, RBAC configuration, HPA configuration. Why a recruiter cares: Kubernetes is increasingly mentioned even in junior NCR DevOps JDs; a working cluster with manifests is proof of hands-on experience, not certification prep.

---

### 2. Job Titles and Application Timing

**Apply now β€” Week 4**

Titles: Junior DevOps Engineer, DevOps Engineer (0-2 years), Cloud Support Engineer, Infrastructure Engineer, Systems Engineer, Associate DevOps Engineer

Platforms: LinkedIn (primary β€” set alerts), Naukri.com (mandatory β€” service companies and GCCs post here exclusively), Instahyre, Wellfound

Company targets in NCR: Nagarro (Gurgaon), GlobalLogic (Noida), Publicis Sapient (Gurgaon), NIIT Technologies, Mphasis, smaller funded startups from Antler portfolio and other VC-backed companies. For service companies: HCL Technologies and Wipro have DevOps JDs that are genuine infrastructure work (not all of them β€” filter by the actual JD content, not just the title).

Direct career pages worth bookmarking: nagarro.com/careers, globallogic.com/careers, publicissapient.com/careers, mphasis.com/careers. These pages often have roles that don't appear on aggregators for 1–2 weeks after posting.

**Apply at Week 8**

You now have AWS concretely on your CV. Expand to: AWS Cloud Engineer (Junior), Cloud Infrastructure Engineer, Cloud Operations Engineer, DevOps Engineer roles that specify AWS in the JD.

New targets: GCCs that explicitly require AWS β€” Optum (Noida), Genpact Technology (Gurgaon), EXL Service, WNS, Concentrix Technology. Apply via their career pages and Naukri simultaneously. Your profile is now meaningfully stronger: end-to-end AWS stack (EC2, IAM, S3, VPC, CloudWatch) with CI/CD and a live project to discuss.

**Do not apply until Week 24**

Platform Engineer β€” typically requires Kubernetes + Terraform + 2+ years. Platform teams at larger companies in NCR have a higher bar.

Senior DevOps Engineer β€” even at companies that post "1-3 years experience required", the actual expectation is 2+ years of real DevOps work. Applying earlier wastes your time.

DevSecOps Engineer β€” security tooling (Vault, Snyk, SAST/DAST pipelines) is a domain requiring deliberate study not on this roadmap.

Cloud Architect β€” requires 5+ years and cert-level AWS knowledge.

---

### 3. What Gets Tested in NCR DevOps Interviews

**Phone screen / HR call**

You will be asked: your experience summary (have a 90-second version), which tools you've used (answer specifically β€” not "I know Docker", but "I've written multi-stage Dockerfiles and run Compose stacks on a VPS"), availability and notice period (fresh grad = immediate joining), salary expectation (state your target, not your floor β€” see Section 5), why DevOps specifically.

They are screening for: articulate communication, genuine experience (not just listed on CV), salary fit, and basic logical thinking. They are not testing technical knowledge here.

**Technical round**

Linux questions that actually appear:
- "What does the first column of `ps aux` output mean?" (process owner)
- "How do you find which process is using port 8080?" (`ss -tlnp | grep 8080` or `lsof -i :8080`)
- "What's the difference between `kill -9` and `kill -15`?" (SIGKILL vs SIGTERM β€” immediate termination vs graceful)
- "A service is failing to start. Walk me through your diagnosis." (`systemctl status`, `journalctl -u servicename -n 50`, check the binary exists and permissions are correct, check the port isn't already in use)
- "What is load average?" (average number of processes waiting to run over 1/5/15 minutes, relative to CPU cores)

Docker questions that actually appear:
- "What's the difference between CMD and ENTRYPOINT in a Dockerfile?" (CMD is overridable at run time, ENTRYPOINT is fixed β€” CMD provides default arguments to ENTRYPOINT)
- "How do containers in a Docker Compose network resolve each other?" (by service name β€” Docker provides built-in DNS for user-defined networks)
- "What's a multi-stage build and why use it?" (multiple FROM instructions, only the final stage goes to the image β€” keeps build tools out of the production image, reduces size and attack surface)
- "How do you make a Docker container restart automatically?" (`--restart unless-stopped` or `restart: unless-stopped` in Compose)

Networking questions that actually appear:
- "Walk me through what happens when a user visits a URL" (DNS lookup β†’ TCP connection β†’ TLS handshake β†’ HTTP request β†’ response)
- "What's the difference between a reverse proxy and a load balancer?" (reverse proxy hides the backend; load balancer distributes across multiple backends β€” Nginx can do both)
- "What does a 502 error mean?" (bad gateway β€” the proxy received an invalid response from the upstream server)
- "How does SSH authentication work?" (client presents public key, server challenges with random data, client signs it with private key, server verifies the signature with the stored public key)

Scripting:
- You may be asked to write a Bash or Python script live. Common formats: "write a script that checks if these services are running and restarts any that aren't", "write a function that parses this log file and counts occurrences of each status code"
- The evaluator is checking: do you use proper error handling (`set -euo pipefail` in bash, `try/except` in Python), do you use functions, do you write readable code

**Practical / take-home task**

Common formats (2–4 hours):
- "Write a Dockerfile for this Python app, a Docker Compose file that adds Redis, and a health check endpoint" β€” you'll have a GitHub repo to fork
- "Write a GitHub Actions workflow that runs tests and deploys to a remote server on push to main" β€” they provide fake SSH credentials or ask you to describe what the secrets would contain
- "Write a Python script that monitors these endpoints and emails you if any return non-200" β€” tests requests, SMTP, argparse
- "Debug this broken docker-compose.yml" β€” they give you a file with 3-5 intentional errors

What separates good submissions: health checks are included, `.env.example` is present, README explains what it does and how to run it, commit history shows incremental work (not one giant commit), you handle error cases explicitly.

---

### 4. Common Mistakes That Prevent Freshers From Getting DevOps Jobs in India

**Listing technologies you can't defend.** If "Kubernetes" is on your CV because you ran `kubectl get pods` once in a tutorial, an interviewer asking "how does a rolling update work?" will end the conversation. Every technology on your CV must have a project behind it and a clear verbal answer to "tell me how you used this."

**GitHub repos without READMEs.** A recruiter spending 30 seconds on a repo with no README closes the tab. A repo called `docker-practice` with three files and no explanation tells a hiring manager nothing about your ability to operate infrastructure. Every repo needs to explain what it runs, why it exists, and what decisions were made.

**Applying to renamed helpdesk roles.** Many "DevOps Engineer" JDs at service companies are L1 support with a rebranded title. Read the JD carefully β€” red flags include: "ITIL certification preferred", "ticketing system experience required", "incident management", no mention of Docker/AWS/Linux in the technical requirements. These roles do not build the skills you need and often pay below the floor you've set.

**Not knowing your own project in depth.** "I deployed a multi-service Docker Compose stack" is not an answer. "I deployed a Flask API with Redis and Postgres in Docker Compose, using multi-stage builds to reduce image size from 1.1GB to 180MB, with health checks so Postgres is confirmed ready before the API starts, running behind Nginx with SSL termination, and automated via GitHub Actions" is an answer. Practice this verbally, not just in your head.

**Treating salary floor as the opening number.** If you say "I'm looking for around β‚Ή35,000" at a startup that would pay β‚Ή55,000 for your profile, you've lost β‚Ή20,000/month permanently. Know the market rates (see Section 5), start at your target ceiling for each company tier, not your floor.

**One-commit GitHub histories.** A repo where all the work appears in a single commit titled "add project files" signals that you copied files over rather than built incrementally. Commit as you work. The commit history is evidence of your process.

**Not applying early enough.** The job search takes time independent of your preparation level. Candidates who start applying at Week 4 get their first offers at Week 8–12. Candidates who wait until they feel "ready" start at Week 8 and get their first offers at Week 14–18. The feedback from real rejections improves your interview performance faster than another week of studying.

**Inconsistent positioning.** Applying to a DevOps role and then mentioning in the interview that you also do React or full-stack work signals that you don't actually want a DevOps role. Decide on the role type and be consistent in every touch point β€” resume, LinkedIn, outreach, interview answers.

**Vague answers to "tell me about your experience."** "I worked at an Antler-backed company on various projects" is vague. "I was a software developer intern at a venture studio backed by Antler, where I shipped production SaaS MVPs across multiple projects β€” I owned Docker configuration and CI/CD pipelines across three projects, integrated with external APIs, and worked with distributed teams across time zones" is not vague.

**Sending generic outreach.** A LinkedIn message that could have been sent to anyone ("I am a passionate DevOps professional seeking opportunities") gets ignored. One that references the company's actual stack or a specific JD they posted gets responses. Take 3 minutes per message to make it specific.

---

### 5. Realistic Salary Ranges

**NCR service companies (TCS, HCL, Wipro, Infosys, Tech Mahindra)**

Range: β‚Ή3.5–5 LPA (CTC). Take-home is ~70–75% of CTC after PF, tax, insurance deductions. At β‚Ή4.5 LPA CTC, take-home is approximately β‚Ή28,000–31,000/month.

Your profile puts you toward the upper end of the fresher band, but service companies have fixed fresher slabs that don't move much for individual profile quality. The internship experience may place you in the "experienced fresher" band at some companies (β‚Ή4.5–5.5 LPA) versus the "campus fresher" band (β‚Ή3.5–4 LPA).

Honest assessment: these roles are the right floor for negotiation, not the target. The actual work at L1/L2 entry in service companies is often not genuine infrastructure. The learning environment is slower. Treat these as the fallback, not the goal.

Skills that move you up within this band: AWS certification (not worth getting for this tier, though), ITIL knowledge (not worth learning for this tier either). Don't optimize for this band.

**NCR funded startups (Series A–C, Antler portfolio)**

Range: β‚Ή5–8 LPA for a genuine junior DevOps role. At β‚Ή6 LPA, take-home is approximately β‚Ή42,000–45,000/month depending on structure.

Your Antler studio connection is a direct advantage here. Antler portfolio companies know what their studio interns produce β€” the signal is stronger than a cold application. FactorSphere as a live production product with real users is unusual for a fresher; most candidates this stage have tutorial clones.

Your realistic target at a well-funded startup: β‚Ή6–7 LPA at Week 4, negotiable to β‚Ή7–8 LPA after Week 8 with AWS on the CV.

Skills that move you up this band: AWS working knowledge (Week 8), ability to own deployment pipelines end-to-end without supervision, Docker and Compose at production level (Week 4), Python scripting that actually runs in their infrastructure.

**Product companies and GCCs (Nagarro, GlobalLogic, Publicis Sapient, Optum, Genpact Tech, EXL)**

Range: β‚Ή6–10 LPA for cloud/DevOps roles. At β‚Ή8 LPA, take-home is approximately β‚Ή55,000–60,000/month.

GCCs pay more than domestic companies for equivalent work β€” they're paying against a global compensation benchmark. The tradeoff is a higher technical bar at screening and more structured interview processes.

Your profile is competitive here after Week 8 (AWS added). Before Week 8, your serverless and Cloudflare experience is harder to map to what GCC interviewers are looking for; after Week 8, the AWS + Docker + CI/CD story is a clean match.

Realistic target at a GCC: β‚Ή7–9 LPA at Week 8, with AWS and a demonstrated CI/CD project.

Skills that move you up this band: specific AWS service depth (beyond EC2/S3 β€” RDS, ECS, ECR, CodePipeline), monitoring/observability (Prometheus or CloudWatch), IaC awareness (Terraform in Phase 3).

**Your Week 4 realistic ceiling (before AWS):** β‚Ή6–7 LPA at a funded startup or mid-size product company where your FactorSphere + Docker + CI/CD story lands well.

**Your Week 8 realistic ceiling (after AWS):** β‚Ή8–9 LPA at a GCC or strong product company.

**Your floor (non-negotiable):** β‚Ή35,000/month = ~β‚Ή4.2 LPA. Achievable at service companies. Don't accept less β€” it's below market even at service companies for a profile with live production experience.

**Negotiation:** When a recruiter asks for your expectation, say the target number for that company tier, not your floor. "I'm looking for β‚Ή6–7 LPA, based on my production experience and the market for junior DevOps roles in NCR." If they push back, ask what the band is before accepting or declining.

---

### 6. Honest Assessment of the Week 4 Target

Week 4 interview-readiness for junior DevOps roles at startups and mid-size product companies in NCR is realistic. This is not inflated encouragement β€” it's based on what your profile actually produces by Week 4:

You arrive at Week 4 with: Python competency, Linux daily driver experience, Docker from internship, Git fluency, a live production product with real users, and production CI/CD experience. These are not nothing. Most fresher DevOps candidates have none of the internship context and only classroom exposure.

By Week 4 you add: ops-level Linux (systemd, SSH hardening, process management, log analysis), practical networking (DNS, HTTP, SSH tunneling, firewalls), DevOps Python scripting (subprocess, requests, YAML/JSON, proper CLIs), and production Docker + Compose (multi-stage builds, health checks, named volumes, systemd-managed stacks).

This is enough to pass a phone screen and a standard junior technical round at a startup or mid-tier product company. It is not enough to pass a rigorous GCC technical screen that probes AWS depth β€” that happens after Week 8.

**Most likely blocker:** The gap between knowing how something works and being able to explain it under mild interview pressure. Docker networking questions specifically β€” "a container in service A can't reach service B, what do you check?" β€” require not just knowing the answer but being comfortable walking through it out loud. This gap closes with deliberate verbal practice, not with more studying. Spend 30 minutes every day of Week 4 talking through your projects as if you're in an interview. Record yourself if you can β€” the first playback will tell you exactly what to fix.

**Second most likely blocker:** Sparse GitHub repos that don't match your verbal claims. If you say "I built a multi-service Docker Compose stack on a VPS" and the recruiter opens the repo to find three files and no README, you've undermined your own story. Prioritize clean, documented repos with meaningful commit history over adding more features.

**Week 6–8 for first offer:** Achievable if you apply at 10+ per week starting Week 4 and follow up on applications actively. Realistic for the right startup or product company.

**Week 8–10 for first offer:** The more likely outcome for most candidates, accounting for interview scheduling delays (common in India), HR processes, and the normal distribution of fit between your profile and open roles. This is not a failure case β€” it's the median outcome for a candidate with your profile executing this plan.

**Week 6–8 for GCC offer:** Unlikely. GCC processes in NCR typically take 4–6 weeks from application to offer even when you pass every round. Apply to GCCs at Week 8 and expect the offer to come at Week 12–14.

The 4-week interview-readiness target is sound. The 6–8 week first offer target is optimistic but achievable. The 8–10 week target is realistic without being pessimistic. The answer to "should I start applying at Week 4 even if I don't feel ready?" is yes, unambiguously.