DevOps Career Roadmap β NCR, 2025β26
DSA β Answered Directly
Do NCR DevOps, Cloud Support, and Infrastructure roles screen for LeetCode-style DSA?
At funded startups targeting DevOps specifically: no. The technical screen is scripting (Python or Bash), Linux troubleshooting scenarios, and Docker/Compose tasks. No arrays, graphs, or dynamic programming.
At product companies and GCCs: mostly no, but with a real exception. Some GCCs (Optum, Genpact Tech, Publicis Sapient) route all applicants through a standardized first-round online assessment that includes basic coding β not algorithmic complexity, but "write a function that does X" type problems a second-year CS student should handle. If you apply to these through their career portal (not via referral or recruiter), you may hit this screen. The coding is at the level of "reverse a string without slicing" or "count words in a paragraph" β not LeetCode mediums. If you've written Python for two years, you're fine. No prep needed.
At service companies (TCS, Wipro, HCL, Infosys): their entry tests include aptitude + basic coding. The coding section is trivial CS101 material, not competitive programming. If this is the blocker, 2 hours of practice on basic Python problems clears it.
Conclusion: No LeetCode prep is needed. No LeetCode appears in this roadmap under any framing. The only coding you'll write as interview prep is ops-relevant Python scripting, which is on the roadmap as a skill anyway.
Technology Rationale Table
| Technology | Immediate Employability Impact | Learnable on the Job? | Why It's Here |
|---|---|---|---|
| Linux (ops-specific) | High | Partially | Every VPS, EC2, and container runs Linux; ops-level commands are tested directly in technical rounds |
| Git (infra conventions) | Medium | Yes | Ops repos, GitOps patterns, and READMEs are how hiring managers evaluate your actual work product |
| Networking | High | Partially | DNS, SSH, firewall, and HTTP debugging appear explicitly in junior DevOps technical screens in NCR |
| Python scripting | High | Partially | Automation scripts are the most common take-home task format; real scripts separate you from CV-padders |
| Docker | High | No | In nearly every junior DevOps JD in NCR; multi-stage builds and Compose are now baseline expectations |
| Docker Compose | High | Partially | Multi-service Compose stacks are the standard deployment unit at companies without Kubernetes |
| Nginx | High | Partially | Reverse proxy knowledge is expected at junior level; SSL termination and upstreams come up in interviews |
| GitHub Actions | High | Yes | CI/CD is table stakes; VPS deploy pipelines are more transferable than Cloudflare-specific ones |
| AWS EC2/IAM/S3/VPC | High | Partially | Most NCR product companies and GCCs run on AWS; differentiates you from candidates who only know on-prem |
| CloudWatch | Medium | Yes | Logging and basic monitoring show operational awareness; AWS-native so low friction to learn |
| Terraform | Medium | Yes | Increasingly in mid-level JDs; rarely a hard gate at junior level in NCR, but Phase 3 depth pays off post-hire |
| Prometheus | Medium | Yes | Monitoring stack is a bonus at junior level; signals genuine ops thinking beyond just deployment |
| Grafana | Medium | Yes | Dashboard work demonstrates operational maturity; trivial to add once Prometheus is running |
| Kubernetes | Medium | No | Appearing in NCR JDs even at junior level now; large surface area justifies Phase 3 placement |
PHASE 1: Weeks 1β4 β Become Employable
Week 1 β Linux (Ops-Specific) + Git (Infra Conventions)
What this week unlocks: Your VPS becomes a real work environment, not a toy. You can answer Linux troubleshooting questions in interviews. FactorSphere gets professional documentation that turns a live project into an interview story.
Study hours: 42
Learning objectives:
- Ops-level networking stack: reading active connections, capturing traffic, diagnosing from the command line
- SSH hardening: key-only auth, non-root user, config file discipline
- systemd: writing, enabling, and managing service units from scratch
- Bash scripting for automation: health checks, backups, log analysis
- Process and resource management at ops level
- Log analysis with journalctl and standard log tooling
- How infrastructure repos differ from application repos; writing READMEs a hiring manager actually reads
- Professional documentation for FactorSphere CI/CD pipeline and architecture
Technologies: Linux (Ubuntu 22.04 on Hetzner VPS), Bash, systemd, UFW, SSH, journalctl, Git
MondayβTuesday: VPS baseline and network stack
SSH into the Hetzner VPS. If you're logging in as root, fix that first.
adduser deploy
usermod -aG sudo deploy
SSH hardening β edit /etc/ssh/sshd_config:
PasswordAuthentication no
PermitRootLogin no
AllowUsers deploy
Copy your public key to the new user:
ssh-copy-id -i ~/.ssh/id_ed25519.pub deploy@<vps-ip>
Generate a dedicated key if you don't have one: ssh-keygen -t ed25519 -C "vps-deploy-key"
Restart sshd: systemctl restart sshd. Verify you can log in as deploy before ending the root session.
UFW setup:
ufw default deny incoming
ufw default allow outgoing
ufw allow from <your-home-ip> to any port 22
ufw allow 80/tcp
ufw allow 443/tcp
ufw enable
ufw status verbose
Networking stack commands β run these and understand every line of output:
ip addr show
ip route show
ss -tlnp # listening TCP sockets with process names
ss -tulnp # TCP + UDP
netstat -tlnp # older but still appears in interviews
Install tcpdump: apt install tcpdump. Run:
tcpdump -i eth0 port 22 # watch your own SSH session
tcpdump -i eth0 port 80 -A # see HTTP traffic in ASCII
You're not becoming a packet analysis expert. You're learning to answer "how would you debug a connectivity issue" with real commands.
Wednesday: systemd
Write a real systemd service. Create a minimal Python HTTP server first:
# /opt/healthapi/server.py
import http.server, json
class Handler(http.server.BaseHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.send_header('Content-Type', 'application/json')
self.end_headers()
self.wfile.write(json.dumps({"status": "ok"}).encode())
def log_message(self, *args): pass
http.server.HTTPServer(('', 8080), Handler).serve_forever()
Service unit /etc/systemd/system/healthapi.service:
[Unit]
Description=Health API
After=network.target
[Service]
Type=simple
User=deploy
WorkingDirectory=/opt/healthapi
ExecStart=/usr/bin/python3 /opt/healthapi/server.py
Restart=on-failure
RestartSec=5
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target
systemctl daemon-reload
systemctl enable healthapi
systemctl start healthapi
systemctl status healthapi
curl http://localhost:8080
journalctl -u healthapi -f # tail logs
journalctl -u healthapi --since "1 hour ago"
journalctl -p err -u healthapi # errors only
Stop the service, break it intentionally (wrong path), observe systemctl status output. This is what debugging looks like.
Thursday: Bash scripting for automation
Write three scripts. These go on GitHub. They are real deliverables, not exercises.
health-check.sh β VPS health report:
#!/bin/bash
set -euo pipefail
THRESHOLD_DISK=80
THRESHOLD_MEM=90
echo "=== VPS Health Check $(date) ==="
# Disk
DISK_USE=$(df / | awk 'NR==2 {print $5}' | tr -d '%')
echo "Disk usage: ${DISK_USE}%"
[ "$DISK_USE" -gt "$THRESHOLD_DISK" ] && echo "WARNING: disk above ${THRESHOLD_DISK}%" >&2
# Memory
MEM_TOTAL=$(free -m | awk '/Mem:/ {print $2}')
MEM_USED=$(free -m | awk '/Mem:/ {print $3}')
MEM_PCT=$(( MEM_USED * 100 / MEM_TOTAL ))
echo "Memory usage: ${MEM_PCT}% (${MEM_USED}/${MEM_TOTAL} MB)"
# Services β edit list for your environment
for SVC in healthapi nginx docker; do
STATUS=$(systemctl is-active "$SVC" 2>/dev/null || echo "not-installed")
echo "Service $SVC: $STATUS"
done
# Open ports
echo "Listening ports:"
ss -tlnp | awk 'NR>1 {print $1, $4, $6}'
backup.sh β timestamped archive:
#!/bin/bash
set -euo pipefail
SRC="${1:?Usage: backup.sh <source-dir>}"
DEST="/var/backups"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
ARCHIVE="${DEST}/backup_${TIMESTAMP}.tar.gz"
mkdir -p "$DEST"
tar czf "$ARCHIVE" "$SRC"
echo "Backup created: $ARCHIVE ($(du -sh "$ARCHIVE" | cut -f1))"
# Keep last 7 backups
ls -t "${DEST}"/backup_*.tar.gz | tail -n +8 | xargs -r rm
echo "Old backups pruned. Current count: $(ls "${DEST}"/backup_*.tar.gz | wc -l)"
log-analyzer.sh β parse journalctl for errors in the last N hours:
#!/bin/bash
HOURS="${1:-24}"
echo "=== Error summary: last ${HOURS} hours ==="
journalctl --since "${HOURS} hours ago" -p err --no-pager | \
awk '{print $5}' | sort | uniq -c | sort -rn | head -20
Make them executable (chmod +x), test them, commit them with a meaningful message: feat(scripts): add VPS health check, backup, and log analysis scripts.
FridayβSaturday: Process management, disk, LVM awareness, log deep dive
Process management:
ps aux --sort=-%cpu | head -10 # top CPU consumers
ps aux --sort=-%mem | head -10 # top memory consumers
kill -9 <pid> # SIGKILL
kill -15 <pid> # SIGTERM (graceful)
nice -n 10 <command> # lower priority
renice -n 5 -p <pid> # change running process priority
top and htop β know what load average means. A load average of 1.0 on a single-core machine = 100% utilization. On a 4-core machine, 1.0 = 25%. This comes up in interviews.
Disk:
df -h # filesystem usage
du -sh /var/* # directory sizes
lsblk # block devices
fdisk -l # partition table
LVM β you may not have LVM on the VPS, but understand the commands:
pvs # physical volumes
vgs # volume groups
lvs # logical volumes
Log analysis:
journalctl --since "2025-01-01" --until "2025-01-02"
journalctl -u nginx --no-pager | grep "502" | wc -l
grep -E "ERROR|WARN" /var/log/syslog | tail -50
zcat /var/log/syslog.2.gz | grep ERROR # compressed log files
Sunday: Git infra conventions + FactorSphere documentation
This is the most interview-impactful work of the week.
Create docs/ in the FactorSphere repo. Write two files:
ARCHITECTURE.md β cover: why Cloudflare Workers instead of a traditional server (latency, no cold starts at edge, cost at zero users), how requests flow (DNS β Cloudflare edge β Worker β Pinecone/LLM β response), data flow for the ranking pipeline (source aggregation β processing Workers β Pinecone indexing β query Workers), why Pinecone over a hosted Postgres vector extension (managed, no infra to maintain), how the frontend on Cloudflare Pages is decoupled from the Worker API, what the tradeoffs are (no persistent state in Workers, Pinecone vendor lock-in, cold start behavior). Use a Mermaid diagram:
graph LR
User --> CF_Edge[Cloudflare Edge]
CF_Edge --> Worker_API[Workers API]
Worker_API --> Pinecone[(Pinecone Vector DB)]
Worker_API --> LLM[LLM Inference]
CF_Pages --> Worker_API
CICD.md β cover: what triggers a deploy (push to main), the GitHub Actions workflow steps (lint β type check β Wrangler deploy), what wrangler deploy actually does (bundles the Worker, uploads to Cloudflare's edge network), how Cloudflare Pages handles frontend deploys automatically (build hook on push), what happens on failure (GitHub Actions marks the workflow run as failed, Wrangler does not promote the broken version β previous version stays live), how to roll back (revert commit + push, or wrangler rollback to a previous deployment ID), what environment variables are injected at deploy time vs stored as Cloudflare secrets.
Write this so you can recite it verbally in a 3-minute interview answer. That's the test.
Infra repo conventions:
- READMEs in infrastructure repos answer: what this runs, how to run it, what environment variables it needs, what the architecture looks like, and what decisions were made and why
- Application repos explain features; infrastructure repos explain operations
- Commit messages in infra repos are imperative and specific:
fix(nginx): increase worker_connections to handle spike loadnotupdate config - Store
.env.examplewith all variable names and no values. Never.env.
Deliverables, Week 1:
- Hetzner VPS: non-root
deployuser, SSH key-only auth, UFW configured,healthapisystemd service running - GitHub repo
ops-scripts:health-check.sh,backup.sh,log-analyzer.shwith meaningful commits and a README - FactorSphere repo:
docs/ARCHITECTURE.mdanddocs/CICD.mdβ professional, interview-ready
Week 2 β Networking (Practical)
What this week unlocks: You can answer every networking question in a junior DevOps technical round. DNS debugging, HTTP troubleshooting, SSH advanced usage, and firewall diagnosis are the most common technical screen topics. This week makes you competent at all of them.
Study hours: 42
Learning objectives:
- DNS: resolution chain, record types, TTL, cache behavior, practical debugging tools
- HTTP: headers, status codes, TLS handshake, what curl reveals
- TCP/IP: three-way handshake, port states, socket inspection
- SSH: config file, tunneling, port forwarding, agent forwarding
- Firewalls: UFW rule management, iptables literacy, nftables awareness
- tcpdump for real diagnosis
MondayβTuesday: DNS
Install: apt install dnsutils on the VPS if not present.
dig factorsphere.org # full answer section
dig +short factorsphere.org # just the IP
dig +trace factorsphere.org # full resolution chain from root
dig @8.8.8.8 factorsphere.org # force a specific resolver
dig @1.1.1.1 factorsphere.org # Cloudflare resolver
dig MX factorsphere.org # mail records
dig TXT factorsphere.org # SPF, DKIM, verification records
dig CNAME www.factorsphere.org # canonical name
dig -x <ip-address> # reverse lookup
Read the AUTHORITY SECTION and ADDITIONAL SECTION in dig output. Understand what TTL means β if you change a DNS record, traffic won't switch until TTL expires. This is how to answer "how long does DNS propagation take?"
Watch DNS queries in real time:
tcpdump -i any port 53 -n
# Open another terminal and run: dig google.com
# Watch the query and response packets appear
Understand /etc/resolv.conf (which nameserver your system queries) and /etc/hosts (local override, checked before DNS). Add an entry to /etc/hosts that maps a fake hostname to localhost, verify it resolves, then remove it.
Configure a real subdomain if you have a domain β point vps.yourdomain.com to your Hetzner IP as an A record. Verify with dig. This demonstrates you've actually managed DNS, not just read about it.
Wednesday: HTTP in depth
curl -v https://factorsphere.org # verbose: see TLS handshake, headers, body
curl -I https://factorsphere.org # HEAD request only β no body
curl -L https://factorsphere.org # follow redirects
curl -X POST https://api.example.com/endpoint \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d '{"key": "value"}'
curl -o /dev/null -s -w "%{http_code}\n" https://factorsphere.org # just status code
Read every header in the curl -v output. Know what these mean:
Cache-Control: max-age=86400β browser can cache for 24 hoursX-Forwarded-Forβ original client IP when behind a proxyStrict-Transport-Securityβ forces HTTPSCF-RAYβ Cloudflare request ID, useful for debugging WorkersContent-Encoding: gzipβ response is compressed
HTTP status codes you must know cold: 200, 201, 204, 301, 302, 304, 400, 401, 403, 404, 422, 429, 500, 502, 503, 504. Know the difference between 401 and 403, between 502 and 503.
TLS handshake β be able to describe: client hello β server hello + certificate β client verifies cert β key exchange β symmetric session established. Not cryptography depth, but the sequence.
openssl s_client -connect factorsphere.org:443 # inspect the TLS certificate
openssl x509 -in cert.pem -noout -dates # check cert expiry
Thursday: SSH advanced
~/.ssh/config β create this file:
Host vps
HostName <your-vps-ip>
User deploy
IdentityFile ~/.ssh/id_ed25519
ServerAliveInterval 60
Host github.com
IdentityFile ~/.ssh/id_ed25519_github
User git
Now ssh vps instead of ssh -i ~/.ssh/id_ed25519 deploy@<ip>.
Local port forwarding β access a service on the VPS that isn't exposed publicly:
ssh -L 5432:localhost:5432 vps
# Now psql -h localhost -p 5432 connects to Postgres on the VPS
Remote port forwarding β expose a local service through the VPS (useful for demos):
ssh -R 8080:localhost:3000 vps
# Traffic to vps:8080 now forwards to your local machine's port 3000
ProxyJump β hop through a bastion:
ssh -J bastion.example.com internal-server.example.com
# Or in config:
# Host internal
# ProxyJump bastion
SSH agent:
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_ed25519
ssh-add -l # list loaded keys
FridayβSaturday: Firewalls and network diagnostics
UFW advanced:
ufw status numbered # numbered rules for deletion
ufw delete 3 # delete rule 3
ufw allow from 10.0.0.0/8 to any port 5432 # Postgres from internal network only
ufw logging on
tail -f /var/log/ufw.log
iptables β you need to read it, not memorize it:
iptables -L -n -v # list all rules with packet/byte counts
iptables -L INPUT -n -v # just INPUT chain
iptables -t nat -L -n # NAT table
UFW uses iptables underneath. When UFW allows port 80, it adds an iptables ACCEPT rule. This is how to answer "how does UFW work under the hood?"
nftables:
nft list ruleset # view current rules
Network diagnostics:
ping -c 4 8.8.8.8 # basic reachability
traceroute 8.8.8.8 # hop-by-hop path
mtr 8.8.8.8 # traceroute + ping combined, live
ss -s # socket statistics summary
ss -tnp state established # established TCP connections
Sunday: Build the networking diagnostics script
This goes on GitHub as a real project.
endpoint-monitor.py β check a list of endpoints and report health:
#!/usr/bin/env python3
"""
Endpoint health monitor β checks DNS, HTTP reachability, and SSL cert expiry.
Usage: python3 endpoint-monitor.py --config endpoints.yaml [--json]
"""
import argparse, json, socket, ssl, datetime, sys
import urllib.request, urllib.error
import yaml
def check_dns(hostname):
try:
ip = socket.gethostbyname(hostname)
return {"status": "ok", "ip": ip}
except socket.gaierror as e:
return {"status": "error", "error": str(e)}
def check_http(url, timeout=10):
try:
req = urllib.request.Request(url, headers={"User-Agent": "endpoint-monitor/1.0"})
with urllib.request.urlopen(req, timeout=timeout) as r:
return {"status": "ok", "http_code": r.status, "latency_ms": None}
except urllib.error.HTTPError as e:
return {"status": "error", "http_code": e.code}
except Exception as e:
return {"status": "error", "error": str(e)}
def check_ssl(hostname, port=443):
try:
ctx = ssl.create_default_context()
with ctx.wrap_socket(socket.create_connection((hostname, port), timeout=10),
server_hostname=hostname) as s:
cert = s.getpeercert()
expiry_str = cert['notAfter']
expiry = datetime.datetime.strptime(expiry_str, "%b %d %H:%M:%S %Y %Z")
days_left = (expiry - datetime.datetime.utcnow()).days
return {"status": "ok", "expires": expiry_str, "days_remaining": days_left}
except Exception as e:
return {"status": "error", "error": str(e)}
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--config", required=True)
parser.add_argument("--json", action="store_true")
args = parser.parse_args()
with open(args.config) as f:
config = yaml.safe_load(f)
results = {}
for endpoint in config.get("endpoints", []):
name = endpoint["name"]
url = endpoint["url"]
hostname = url.split("//")[-1].split("/")[0]
results[name] = {
"dns": check_dns(hostname),
"http": check_http(url),
"ssl": check_ssl(hostname) if url.startswith("https") else None,
}
if args.json:
print(json.dumps(results, indent=2))
else:
for name, checks in results.items():
print(f"\n{name}:")
for check_name, result in checks.items():
if result:
status = "β" if result["status"] == "ok" else "β"
print(f" {status} {check_name}: {result}")
any_failure = any(
c["status"] == "error"
for r in results.values()
for c in r.values() if c
)
sys.exit(1 if any_failure else 0)
if __name__ == "__main__":
main()
endpoints.yaml:
endpoints:
- name: FactorSphere
url: https://factorsphere.org
- name: FactorSphere API
url: https://api.factorsphere.org
- name: VPS
url: http://<your-vps-ip>:8080
Deliverables, Week 2:
ops-scriptsrepo updated:endpoint-monitor.pyadded withendpoints.yaml.exampleand updated README- Can verbally answer in an interview: "Walk me through what happens when a user types factorsphere.org and hits Enter" β from DNS query through TLS through Cloudflare edge to Worker response
- VPS subdomain configured (if you have a domain) and verified with
dig
Week 3 β Python DevOps Scripting
What this week unlocks: Python automation is the most common take-home task format in NCR DevOps interviews. By the end of this week you have a GitHub repo with real scripts and can complete a take-home assignment in 2 hours rather than 4.
Study hours: 42
Learning objectives:
subprocess: running shell commands from Python, capturing output, handling return codesos/sys: environment variables, path operations, argument handlingargparse: building proper CLI tools with flags and help textrequests: HTTP calls, error handling, timeouts, retriesyaml/json: config parsing, output generation- Writing scripts that do real infrastructure work β deploy checks, API monitors, config validators, log parsers
MondayβTuesday: subprocess + os + sys + argparse
subprocess β the right way:
import subprocess
# Run a command, capture output, check return code
result = subprocess.run(
["systemctl", "is-active", "nginx"],
capture_output=True,
text=True,
timeout=10
)
print(result.stdout.strip()) # "active" or "inactive"
print(result.returncode) # 0 = active, 3 = inactive
# Run shell command (avoid when possible β harder to escape safely)
result = subprocess.run(
"df -h | grep '/$'",
shell=True, capture_output=True, text=True
)
# Raise on non-zero exit
subprocess.run(["docker", "ps"], check=True) # raises CalledProcessError if docker fails
os and sys:
import os, sys
# Environment variables
api_key = os.environ.get("CF_API_KEY") # returns None if not set, no KeyError
api_key = os.environ["CF_API_KEY"] # raises KeyError if not set β use when required
# Paths
os.path.join("/var/log", "nginx", "access.log")
os.path.exists("/etc/nginx/nginx.conf")
os.path.abspath("../config")
# Script directory (useful for loading config files relative to script)
script_dir = os.path.dirname(os.path.abspath(__file__))
config_path = os.path.join(script_dir, "config.yaml")
# Exit with status code (important for shell scripts that call your Python)
sys.exit(0) # success
sys.exit(1) # failure
argparse β build a real CLI:
import argparse
def main():
parser = argparse.ArgumentParser(
description="Check service health on a VPS",
formatter_class=argparse.RawDescriptionHelpFormatter
)
parser.add_argument("--host", required=True, help="VPS hostname or IP")
parser.add_argument("--port", type=int, default=8080, help="Health endpoint port")
parser.add_argument("--service", action="append", dest="services",
help="systemd service to check (repeat for multiple)")
parser.add_argument("--verbose", "-v", action="store_true")
parser.add_argument("--output-format", choices=["text", "json"], default="text")
args = parser.parse_args()
# args.host, args.port, args.services, args.verbose, args.output_format
Build deploy-check.py β takes a service name as argument, checks it's active, verifies the port is listening, hits the health endpoint, prints pass/fail with exit code:
#!/usr/bin/env python3
"""
Post-deploy smoke test: checks systemd service, port, and HTTP health endpoint.
Usage: python3 deploy-check.py --service nginx --port 80 --endpoint /health
"""
import argparse, subprocess, socket, sys
import requests
def check_service(name):
r = subprocess.run(["systemctl", "is-active", name],
capture_output=True, text=True)
return r.stdout.strip() == "active"
def check_port(port, host="localhost"):
try:
with socket.create_connection((host, port), timeout=5):
return True
except (socket.timeout, ConnectionRefusedError, OSError):
return False
def check_http(url, timeout=10):
try:
r = requests.get(url, timeout=timeout)
return r.status_code < 500, r.status_code
except requests.exceptions.RequestException as e:
return False, str(e)
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--service", required=True)
parser.add_argument("--port", type=int, required=True)
parser.add_argument("--endpoint", default="/health")
parser.add_argument("--host", default="localhost")
args = parser.parse_args()
checks = [
("service active", check_service(args.service)),
("port listening", check_port(args.port, args.host)),
]
http_ok, http_detail = check_http(f"http://{args.host}:{args.port}{args.endpoint}")
checks.append((f"HTTP {args.endpoint}", http_ok))
passed = all(ok for _, ok in checks)
for name, ok in checks:
status = "PASS" if ok else "FAIL"
print(f"[{status}] {name}")
sys.exit(0 if passed else 1)
if __name__ == "__main__":
main()
Wednesday: requests + API interaction
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
# Basic
r = requests.get("https://api.example.com/status", timeout=10)
r.raise_for_status() # raises HTTPError for 4xx/5xx
data = r.json()
# With headers
r = requests.get(
"https://api.cloudflare.com/client/v4/accounts",
headers={"Authorization": f"Bearer {os.environ['CF_API_TOKEN']}"},
timeout=10
)
# Retry logic
session = requests.Session()
retry = Retry(total=3, backoff_factor=1, status_forcelist=[500, 502, 503, 504])
adapter = HTTPAdapter(max_retries=retry)
session.mount("https://", adapter)
r = session.get("https://api.example.com", timeout=10)
Build cloudflare-deploy-monitor.py β checks the status of the most recent FactorSphere Workers deployment via the Cloudflare API. Reads CF_API_TOKEN and CF_ACCOUNT_ID from environment variables. Outputs the deployment status and timestamp. Returns exit code 1 if the last deployment failed. This is a real script that interacts with a real production system.
Cloudflare API endpoint to use: GET /accounts/{account_id}/workers/scripts/{script_name}/deployments (check current CF API docs for exact path β the concept is what matters here, the implementation requires your actual credentials).
Thursday: YAML/JSON parsing
import yaml, json
# YAML
with open("config.yaml") as f:
config = yaml.safe_load(f) # safe_load, not load β avoids arbitrary code execution
# JSON
with open("output.json") as f:
data = json.load(f)
# Pretty print
print(json.dumps(data, indent=2, default=str)) # default=str handles datetime objects
# Write YAML
with open("generated-config.yaml", "w") as f:
yaml.dump(config, f, default_flow_style=False)
Build config-validator.py β reads a YAML deployment config, validates required keys exist and have the right types, outputs errors if invalid, exits 0 if valid:
REQUIRED_FIELDS = {
"service_name": str,
"image": str,
"port": int,
"health_endpoint": str,
"environment": dict,
}
This is a real pattern β CI pipelines often validate config before proceeding.
FridayβSaturday: Full ops script
Build vps-health-reporter.py β this is the Week 3 anchor deliverable. It's a single script that does everything:
Usage: python3 vps-health-reporter.py [--json] [--verbose] [--output FILE]
Checks:
- Disk usage per filesystem (warns if > configurable threshold)
- Memory usage
- systemd services (configured list)
- Port reachability (configured list)
- HTTP endpoints (configured list with expected status codes)
- SSL cert expiry for HTTPS endpoints (warns if < 30 days)
Output:
- Text table by default
- JSON with --json flag
- Writes to file with --output flag
- Exit code 0 if all checks pass, 1 if any fail
Config: reads from vps-health-reporter.yaml:
services:
- nginx
- healthapi
- docker
ports:
- host: localhost
port: 80
name: nginx-http
- host: localhost
port: 8080
name: healthapi
endpoints:
- url: http://localhost:8080
expected_status: 200
name: healthapi-root
- url: https://factorsphere.org
expected_status: 200
name: factorsphere
disk_threshold_pct: 80
memory_threshold_pct: 90
ssl_warning_days: 30
This script uses subprocess, requests, ssl, socket, yaml, json, argparse, sys, os. It solves a real problem β paste it into any VPS and get a health report.
Sunday: Polish and documentation
requirements.txt:
requests==2.31.0
PyYAML==6.0
Meaningful commit history β if all your commits are add scripts, you're doing it wrong. Examples of correct commit messages:
feat(deploy-check): add HTTP health endpoint validation
fix(endpoint-monitor): handle SSL cert expiry for non-HTTPS endpoints gracefully
feat(vps-reporter): add configurable disk/memory thresholds from YAML config
refactor(cloudflare-monitor): extract API client to reusable class
README for the repo: what problem each script solves, prerequisites, installation (pip install -r requirements.txt), example usage for each script, example output.
Deliverables, Week 3:
ops-scriptsrepo: 5+ scripts (health-check.sh,backup.sh,log-analyzer.sh,endpoint-monitor.py,deploy-check.py,cloudflare-deploy-monitor.py,config-validator.py,vps-health-reporter.py),requirements.txt,endpoints.yaml.example,vps-health-reporter.yaml.example, clean README, 20+ meaningful commits
Week 4 β Docker (Production) + Docker Compose + Application Strategy
What this week unlocks: Docker and Compose are tested in almost every junior DevOps interview in NCR. By the end of this week you have a real multi-service stack running on the VPS β something you can show and explain in a technical round. Sunday is the application strategy session.
Study hours: 42
Learning objectives:
- Multi-stage Dockerfiles: how and why, not just what
- Image optimization: layer caching order,
.dockerignore, minimal base images - Container networking: how containers resolve each other by name in Compose
- Volume management: named volumes vs bind mounts, when each is appropriate
- Health checks: HEALTHCHECK instruction, container health states,
depends_onconditions - Running Docker Compose stacks as persistent services on VPS
- Docker Compose: full multi-service stack,
.envfiles, override files, Makefile
MondayβTuesday: Production Dockerfile patterns
You've used Docker. These are the patterns that distinguish junior from intern-level usage.
Multi-stage build for a Python app:
# syntax=docker/dockerfile:1
# ββ Stage 1: builder ββββββββββββββββββββββββββββββββββββββββββ
FROM python:3.11-slim AS builder
WORKDIR /build
# Copy only requirements first β Docker caches this layer
# If requirements.txt doesn't change, this layer is reused on rebuild
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt
# ββ Stage 2: runtime ββββββββββββββββββββββββββββββββββββββββββ
FROM python:3.11-slim AS runtime
# Create non-root user
RUN useradd --create-home --shell /bin/bash appuser
WORKDIR /app
# Copy only the installed packages from builder
COPY --from=builder /root/.local /home/appuser/.local
COPY --chown=appuser:appuser . .
USER appuser
# PATH must include user-installed packages
ENV PATH=/home/appuser/.local/bin:$PATH
EXPOSE 5000
HEALTHCHECK --interval=30s --timeout=10s --start-period=10s --retries=3 \
CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:5000/health')" || exit 1
CMD ["python", "-m", "gunicorn", "--bind", "0.0.0.0:5000", "--workers", "2", "app:app"]
.dockerignore:
.git
.gitignore
__pycache__
*.pyc
*.pyo
.pytest_cache
.env
*.env
node_modules
.venv
dist
build
*.egg-info
README.md
docs/
Why layer caching order matters: COPY requirements.txt . + RUN pip install before COPY . . means requirements are cached as long as requirements.txt doesn't change. If you COPY . . first, every file change invalidates the pip install layer. Run docker build twice β second run should show CACHED for the pip layer.
Compare image sizes:
docker images | grep myapp
# naive (python:3.11): ~1.1GB
# multi-stage (python:3.11-slim): ~200MB
# distroless: ~100MB
Container networking:
docker network create mynet
docker run -d --name db --network mynet postgres:15
docker run -it --network mynet alpine ping db # resolves by container name
docker inspect mynet # see connected containers and IP assignments
Wednesday: Docker Compose
Build a docker-compose.yml for a real multi-service stack:
version: "3.9"
services:
api:
build:
context: ./api
dockerfile: Dockerfile
image: myapi:latest
container_name: myapi
ports:
- "5000:5000"
environment:
- DATABASE_URL=postgresql://postgres:${POSTGRES_PASSWORD}@db:5432/appdb
- REDIS_URL=redis://cache:6379/0
- SECRET_KEY=${SECRET_KEY}
depends_on:
db:
condition: service_healthy
cache:
condition: service_started
restart: unless-stopped
networks:
- backend
db:
image: postgres:15-alpine
container_name: mydb
environment:
POSTGRES_DB: appdb
POSTGRES_USER: postgres
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
volumes:
- postgres_data:/var/lib/postgresql/data
- ./db/init.sql:/docker-entrypoint-initdb.d/init.sql:ro
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres -d appdb"]
interval: 10s
timeout: 5s
retries: 5
restart: unless-stopped
networks:
- backend
cache:
image: redis:7-alpine
container_name: myredis
command: redis-server --appendonly yes
volumes:
- redis_data:/data
restart: unless-stopped
networks:
- backend
volumes:
postgres_data:
redis_data:
networks:
backend:
driver: bridge
.env (never commit this β commit .env.example):
POSTGRES_PASSWORD=changeme_in_production
SECRET_KEY=changeme_in_production
Override for development β docker-compose.override.yml (only loaded locally, not in CI):
services:
api:
volumes:
- ./api:/app # bind mount for hot reload in dev
environment:
- DEBUG=true
Production doesn't have the override file, so bind mounts don't exist in prod.
Thursday: Containers as systemd services + resource limits
Running Docker Compose as a systemd service on the VPS:
/etc/systemd/system/myapp.service:
[Unit]
Description=MyApp Docker Compose Stack
Requires=docker.service
After=docker.service network-online.target
Wants=network-online.target
[Service]
Type=oneshot
RemainAfterExit=yes
WorkingDirectory=/opt/myapp
ExecStart=/usr/bin/docker compose up -d --remove-orphans
ExecStop=/usr/bin/docker compose down
TimeoutStartSec=120
TimeoutStopSec=30
User=deploy
[Install]
WantedBy=multi-user.target
Resource limits in Compose:
services:
api:
deploy:
resources:
limits:
cpus: "0.5"
memory: 256M
reservations:
memory: 128M
Log management β prevent containers from filling your disk:
services:
api:
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
FridayβSaturday: Build the anchor project β vps-multiservice
This is the Week 4 portfolio project. Build it properly.
Project: Multi-service API stack on VPS
Repository: vps-multiservice
Structure:
vps-multiservice/
βββ api/
β βββ app.py
β βββ Dockerfile
β βββ requirements.txt
βββ db/
β βββ init.sql
βββ nginx/ # placeholder config β Week 5 replaces this
β βββ default.conf
βββ docker-compose.yml
βββ docker-compose.override.yml
βββ .env.example
βββ Makefile
βββ README.md
api/app.py β a real Flask API, not hello-world:
from flask import Flask, jsonify
import psycopg2, redis, os, time
app = Flask(__name__)
def get_db():
return psycopg2.connect(os.environ["DATABASE_URL"])
def get_redis():
url = os.environ.get("REDIS_URL", "redis://cache:6379/0")
return redis.from_url(url)
@app.route("/health")
def health():
checks = {}
# Check Postgres
try:
conn = get_db()
conn.close()
checks["database"] = "ok"
except Exception as e:
checks["database"] = f"error: {e}"
# Check Redis
try:
r = get_redis()
r.ping()
checks["cache"] = "ok"
except Exception as e:
checks["cache"] = f"error: {e}"
all_ok = all(v == "ok" for v in checks.values())
return jsonify({"status": "ok" if all_ok else "degraded", "checks": checks}), \
200 if all_ok else 503
@app.route("/info")
def info():
return jsonify({
"service": "vps-multiservice-api",
"version": os.environ.get("APP_VERSION", "dev"),
"uptime": time.time()
})
@app.route("/cache/set/<key>/<value>")
def cache_set(key, value):
r = get_redis()
r.setex(key, 300, value)
return jsonify({"stored": key})
@app.route("/cache/get/<key>")
def cache_get(key):
r = get_redis()
value = r.get(key)
return jsonify({"key": key, "value": value.decode() if value else None})
if __name__ == "__main__":
app.run(host="0.0.0.0", port=5000)
Makefile:
.PHONY: up down logs ps build restart deploy
up:
docker compose up -d --build
down:
docker compose down
logs:
docker compose logs -f
ps:
docker compose ps
build:
docker compose build --no-cache
restart:
docker compose restart
health:
curl -s http://localhost:5000/health | python3 -m json.tool
shell-api:
docker compose exec api /bin/bash
shell-db:
docker compose exec db psql -U postgres -d appdb
README.md must answer: what this project is, what services it runs, how to start it locally, how to deploy to production, what the health endpoint checks, what environment variables are required, what the architecture looks like (diagram), and what decisions were made (why named volumes over bind mounts, why service_healthy condition on depends_on, why non-root user in Dockerfile).
Deploy to the Hetzner VPS: clone the repo, create .env from .env.example, make up, verify make health returns 200. Configure the systemd service so it starts on boot.
Sunday: Application strategy
This is the only Sunday in Phase 1 not dedicated to pure technical work. Block the full day.
Resume β one page, PDF:
Header: your name, email, GitHub URL, LinkedIn URL, location (Delhi NCR), phone.
Title: Junior DevOps & Infrastructure Engineer
Experience section:
Software Developer Intern β [Antler-backed venture studio] [dates]
β’ Shipped production SaaS MVPs across multiple projects; owned Docker-based
deployment pipelines and CI/CD configuration for 3+ products
β’ Implemented GitHub Actions workflows for automated testing and deployment
β’ Worked across time zones with distributed team
Projects section (this is more important than the internship for a DevOps role):
FactorSphere (factorsphere.org) β Production Edge Platform
β’ Live academic journal ranking platform with 4,000+ journals, real users,
Google-indexed; won college project exhibition
β’ Serverless microservices on Cloudflare Workers (analogous to AWS Lambda +
API Gateway), Pinecone vector database, LLM inference pipeline
β’ CI/CD: GitHub Actions β Wrangler β Cloudflare edge deployment
β’ Architecture: fully edge-native; documented in ARCHITECTURE.md on GitHub
VPS Multi-Service Stack (github.com/you/vps-multiservice)
β’ Multi-service API stack on Ubuntu VPS: Flask + Redis + PostgreSQL
β’ Multi-stage Docker builds, Docker Compose, named volumes, health checks
β’ systemd service management, UFW firewall configuration, SSH hardening
Skills section:
Infrastructure: Linux (Ubuntu/Arch), Docker, Docker Compose, systemd, UFW, SSH
Scripting: Python (subprocess, requests, argparse, YAML/JSON), Bash
CI/CD: GitHub Actions, Wrangler (Cloudflare)
Networking: DNS, HTTP/S, TCP/IP, SSL/TLS, reverse proxy concepts
Observability: log analysis (journalctl), endpoint monitoring
Platforms: Cloudflare Workers/Pages, Pinecone, Hetzner VPS
Version Control: Git (branching, rebasing, structured commits)
Do not list technologies you can't defend in a 5-minute conversation. If Kubernetes is not on your CV, a recruiter won't ask about it. If it is, they will.
LinkedIn:
- Headline: Junior DevOps Engineer | Cloudflare Workers | Docker | Python | Delhi NCR
- About section: 3 sentences. "CS grad with production experience at an Antler-backed venture studio. Built and deployed FactorSphere (factorsphere.org), a live platform running on a serverless edge architecture. Targeting junior DevOps and cloud infrastructure roles in Delhi NCR."
- Add all projects with links
- Enable "Open to Work" (visible to recruiters, not your network if you prefer)
Job titles to search:
- "DevOps Engineer" + fresher/junior/0-2 years
- "Cloud Support Engineer"
- "Infrastructure Engineer"
- "Site Reliability Engineer" (rare at fresher level but exists)
- "Systems Engineer" (often infrastructure work at service companies)
Platforms, in priority order:
- LinkedIn β set job alerts for each title + Delhi, Gurgaon, Noida
- Naukri.com β mandatory for NCR; service companies and GCCs post exclusively here
- Instahyre β funded product companies
- Wellfound (AngelList) β funded startups
- Company career pages directly: Nagarro, GlobalLogic, Publicis Sapient, Genpact, NIIT Technologies, Mphasis
Outreach message template (LinkedIn β max 5 lines):
Hi [Name] β I'm a CS grad with production experience shipping SaaS MVPs at an Antler-backed venture studio, including a live platform (factorsphere.org) running on a serverless edge architecture with Docker, CI/CD, and Python scripting across projects. I'm targeting junior DevOps roles in NCR and noticed [Company] works on [relevant tech or cloud platform from their JD]. Would it be appropriate to share my CV directly?
Send this to: DevOps leads, engineering managers, or HR at target companies. Not recruiters at staffing agencies (waste of time for this profile). Target 10 outreach messages in the first week of applying.
By Sunday evening, completed:
- Resume PDF finalized
- LinkedIn updated
- Naukri profile created with correct resume
- 5 job applications submitted
- 5 outreach messages sent on LinkedIn
Deliverables, Week 4:
vps-multiservicerepo: running on VPS, full README, Makefile,.env.example, meaningful commit history showing incremental developmentops-scriptsrepo: polished with all Week 1β3 scripts, clean READMEendpoint-monitor.pyandvps-health-reporter.pyrunning and documented- FactorSphere:
docs/ARCHITECTURE.mdanddocs/CICD.mdcommitted - Resume PDF (one page)
- LinkedIn updated
- 5+ applications submitted, 5+ outreach messages sent
- Job alert set on LinkedIn and Naukri
PHASE 2: Weeks 5β12 β Become Hireable by Stronger Companies
Week 5 β Nginx
What this week unlocks: Reverse proxy knowledge is expected at junior level. SSL termination and upstream configuration come up in almost every DevOps technical round. You also get HTTPS on your VPS project β a visible signal of operational maturity.
Study hours: 42
Learning objectives: Virtual hosts, reverse proxy, SSL termination with Let's Encrypt, load balancing upstream blocks, rate limiting, security headers, static file serving
Monday: Installation and configuration structure
apt install nginx
systemctl enable nginx
systemctl start nginx
nginx -v
Configuration structure on Ubuntu:
/etc/nginx/nginx.confβ main config, definesworker_processes,events, and thehttpblock/etc/nginx/sites-available/β your server blocks live here/etc/nginx/sites-enabled/β symlinks to sites-available for active configsnginx -tβ test config syntax; always run beforesystemctl reload nginxsystemctl reload nginxβ graceful reload (no dropped connections) vsrestart(all connections dropped)
Read /etc/nginx/nginx.conf. Understand: worker_processes auto uses all cores, worker_connections 1024 limits connections per worker, the include /etc/nginx/sites-enabled/* line.
TuesdayβWednesday: Reverse proxy + SSL
Create /etc/nginx/sites-available/myapp:
server {
listen 80;
server_name vps.yourdomain.com;
location / {
proxy_pass http://127.0.0.1:5000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_connect_timeout 30s;
proxy_read_timeout 30s;
proxy_buffering on;
}
location /health {
proxy_pass http://127.0.0.1:5000/health;
access_log off; # don't pollute logs with health checks
}
}
ln -s /etc/nginx/sites-available/myapp /etc/nginx/sites-enabled/
nginx -t
systemctl reload nginx
SSL with Certbot:
apt install certbot python3-certbot-nginx
certbot --nginx -d vps.yourdomain.com
# Follow prompts: enter email, agree to TOS, choose redirect HTTPβHTTPS
certbot renew --dry-run # verify auto-renewal works
After Certbot runs, inspect /etc/nginx/sites-available/myapp β Certbot modifies it. The config now has a listen 443 ssl block and an HTTP-to-HTTPS redirect. Read and understand what was added.
Auto-renewal is handled by a systemd timer installed by Certbot: systemctl status certbot.timer.
Thursday: Multiple virtual hosts + static files
Three server blocks:
docs.yourdomain.comβ serves static HTML files from/var/www/docs/api.yourdomain.comβ reverse proxy to Flask API on port 5000- Default catch-all β returns 444 (nginx closes connection without response)
Static file serving:
server {
listen 443 ssl;
server_name docs.yourdomain.com;
# (SSL certs added by certbot)
root /var/www/docs;
index index.html;
location / {
try_files $uri $uri/ =404;
}
# Cache static assets
location ~* \.(css|js|png|jpg|ico)$ {
expires 30d;
add_header Cache-Control "public, immutable";
}
}
Default catch-all:
server {
listen 80 default_server;
listen 443 ssl default_server;
server_name _;
ssl_certificate /etc/ssl/certs/ssl-cert-snakeoil.pem;
ssl_certificate_key /etc/ssl/private/ssl-cert-snakeoil.key;
return 444;
}
FridayβSaturday: Load balancing + rate limiting + security headers
Upstream block for load balancing:
upstream api_backends {
least_conn; # route to server with fewest active connections
server 127.0.0.1:5000 weight=2;
server 127.0.0.1:5001 weight=1; # start a second Flask instance for this exercise
keepalive 32;
}
server {
location / {
proxy_pass http://api_backends;
}
}
Rate limiting:
# In http block (nginx.conf or a conf.d include):
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;
limit_req_zone $binary_remote_addr zone=health_limit:1m rate=60r/m;
# In server block:
location /api/ {
limit_req zone=api_limit burst=20 nodelay;
proxy_pass http://api_backends;
}
Security headers:
add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-XSS-Protection "1; mode=block" always;
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
add_header Referrer-Policy "strict-origin-when-cross-origin" always;
Sunday: Update vps-multiservice
Replace the placeholder Nginx container in Docker Compose with a reference to the host Nginx. Update README.md:
- Architecture diagram showing: internet β Nginx (host, SSL) β Docker network β Flask API container
- SSL configuration documented
- How to reproduce (Certbot commands, nginx site config)
Update the vps-multiservice README with a "Production Deployment" section showing the full stack.
Week 6 β GitHub Actions for VPS
What this week unlocks: CI/CD for VPS-hosted projects is the most visible resume signal. This pipeline demonstrates that your Docker Compose stack is managed like production infrastructure, not run manually. Combined with FactorSphere's existing CI/CD, you can now speak to two different CI/CD patterns.
Study hours: 42
MondayβTuesday: GitHub Actions syntax
Create .github/workflows/deploy.yml in vps-multiservice:
name: Deploy to VPS
on:
push:
branches: [main]
workflow_dispatch: # allow manual trigger from GitHub UI
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Validate docker-compose
run: docker compose config
deploy:
needs: test
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v4
- name: Deploy to VPS
uses: appleboy/ssh-action@v1.0.0
with:
host: ${{ secrets.VPS_HOST }}
username: deploy
key: ${{ secrets.VPS_SSH_KEY }}
script: |
cd /opt/myapp
git pull origin main
docker compose pull
docker compose up -d --build --remove-orphans
docker system prune -f
- name: Smoke test
run: |
sleep 10
curl -f https://vps.yourdomain.com/health || exit 1
GitHub Secrets to configure (Settings β Secrets and Variables β Actions):
VPS_HOST: your VPS IPVPS_SSH_KEY: contents of a dedicated deploy private key (generate a newed25519key specifically for GitHub Actions, add the public key to~/.ssh/authorized_keyson the VPS)
Never put the private key in the repo or hardcode it in the workflow.
WednesdayβThursday: Zero-downtime consideration + notifications
The naive docker compose up -d --build restarts containers, causing brief downtime. For a portfolio project this is acceptable. Document this limitation in the README and explain what a production mitigation looks like (blue-green deployment, rolling update in Kubernetes, or a health-check grace period with a load balancer).
Add failure notification via GitHub's built-in email (no setup required β GitHub emails you when a workflow fails on default).
Add a status badge to the README:

FridayβSaturday: Makefile deployment tooling
Add to the Makefile:
deploy:
@echo "Deploying to VPS..."
ssh deploy@$(VPS_HOST) "cd /opt/myapp && git pull && docker compose up -d --build"
rollback:
@echo "Rolling back to previous image..."
ssh deploy@$(VPS_HOST) "cd /opt/myapp && git checkout HEAD~1 && docker compose up -d"
status:
ssh deploy@$(VPS_HOST) "docker compose ps && curl -s http://localhost:5000/health"
logs-prod:
ssh deploy@$(VPS_HOST) "docker compose logs -f --tail=100"
Usage with make VPS_HOST=<ip> deploy or set VPS_HOST in a local .env.make (not committed).
Sunday: FactorSphere CI/CD comparison
Update docs/CICD.md in FactorSphere to add a comparison section:
## CI/CD Pattern Comparison
FactorSphere (Cloudflare Workers):
Trigger: push to main
Pipeline: GitHub Actions β wrangler deploy β Cloudflare edge network
Rollback: wrangler rollback <deployment-id> or git revert + push
State: stateless Workers, no server to manage
vps-multiservice (Docker Compose + SSH):
Trigger: push to main
Pipeline: GitHub Actions β SSH β git pull β docker compose up
Rollback: git checkout HEAD~1 + docker compose up (or tag-based rollback)
State: stateful services (Postgres data in named volume), must manage carefully
Being able to explain this comparison β why each pattern exists, what tradeoffs each makes β is a strong interview signal.
Weeks 7β9 β AWS (EC2, IAM, S3, VPC, CloudWatch)
Three weeks of AWS. You are building toward a single coherent project: the same multi-service stack, now running on EC2, with IAM roles, S3 backups, VPC networking, and CloudWatch logging.
Week 7 β EC2 + IAM
What this week unlocks: AWS is on your CV with real hands-on evidence. Most NCR product companies and GCCs require at least basic AWS. After this week you can pass a cloud support phone screen.
Learning objectives: EC2 launch and management, SSH key pairs, security groups, IAM users/roles/policies, principle of least privilege, AWS CLI, instance profiles
MondayβTuesday: AWS account + IAM baseline
Create an AWS account (personal email, not institutional). Immediately:
- Enable MFA on the root account
- Create a billing alarm: CloudWatch β Alarms β Create Alarm β Billing β Total Estimated Charge β threshold $5 β email notification. This protects against accidentally leaving expensive resources running.
- Create an IAM user for yourself:
yourname-admin,AdministratorAccesspolicy, programmatic + console access, enable MFA - Never use root again
Configure AWS CLI:
aws configure
# AWS Access Key ID: [from IAM user]
# AWS Secret Access Key: [from IAM user]
# Default region: ap-south-1 (Mumbai β lowest latency from Delhi)
# Default output format: json
aws sts get-caller-identity # verify who you're authenticated as
aws iam get-user # verify correct user
IAM β understand these three things cold:
- User: a person or application with long-term credentials (access keys)
- Role: an identity assumed temporarily; no long-term credentials; assumed by EC2 instances, Lambda, other services
- Policy: a JSON document that defines what actions are allowed on which resources
Create a principle of least privilege policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::my-specific-bucket",
"arn:aws:s3:::my-specific-bucket/*"
]
}
]
}
This is more secure than "Action": "s3:*" and more secure than "Resource": "*". Be ready to explain why.
WednesdayβThursday: EC2 launch and management
Launch a t2.micro (free tier) with Ubuntu 22.04 in ap-south-1:
- AMI: Ubuntu 22.04 LTS (search "ubuntu 22.04 hvm" in Community AMIs or use the Quick Start)
- Instance type: t2.micro (free tier eligible)
- Key pair: create a new key pair named
ec2-deploy-key, download the.pem - Security group: create
web-sg- Inbound: SSH (22) from your IP only (not 0.0.0.0/0), HTTP (80) from 0.0.0.0/0, HTTPS (443) from 0.0.0.0/0
- Outbound: all traffic (default)
- Storage: 8GB gp3 (free tier)
SSH in:
chmod 400 ec2-deploy-key.pem
ssh -i ec2-deploy-key.pem ubuntu@<ec2-public-ip>
Install Docker on the EC2:
apt update
apt install -y docker.io docker-compose-plugin
usermod -aG docker ubuntu
systemctl enable docker
systemctl start docker
User data bootstrap β relaunch with a user data script that installs Docker automatically. In the EC2 launch wizard, under "Advanced Details" β "User data":
#!/bin/bash
apt-get update -y
apt-get install -y docker.io docker-compose-plugin git
usermod -aG docker ubuntu
systemctl enable docker
systemctl start docker
This is how you automate instance bootstrap β important concept for EC2.
FridayβSaturday: IAM role + instance profile
Create an IAM role for EC2 to access S3:
- IAM β Roles β Create role
- Trusted entity: AWS service β EC2
- Policy: create inline policy:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": ["s3:PutObject", "s3:GetObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::your-backup-bucket",
"arn:aws:s3:::your-backup-bucket/*"
]
}]
}
- Name the role
ec2-app-role - Attach the role to your EC2 instance: EC2 β Actions β Security β Modify IAM role
Now on the EC2, without any credentials:
aws s3 ls s3://your-backup-bucket # works because instance has IAM role
aws sts get-caller-identity # shows the assumed-role identity
This is the correct way to give EC2 access to AWS services β not by putting access keys on the instance. Be ready to explain why: if the instance is compromised, long-term credentials in environment variables are exfiltrated; a role is temporary and scoped.
Deploy vps-multiservice to EC2: clone the repo, create .env, docker compose up -d. Verify curl http://<ec2-ip>:5000/health returns 200.
Week 8 β S3 + VPC
Learning objectives: S3 bucket operations, IAM policies for S3, static hosting, lifecycle policies, presigned URLs, boto3; VPC fundamentals β subnets, route tables, internet gateway, security groups vs NACLs
MondayβTuesday: S3
# Create bucket
aws s3 mb s3://your-backup-bucket-$(date +%s) --region ap-south-1
# Upload and download
aws s3 cp backup.tar.gz s3://your-backup-bucket/backups/backup.tar.gz
aws s3 sync ./logs/ s3://your-backup-bucket/logs/
aws s3 ls s3://your-backup-bucket/ --recursive
# Static website hosting
aws s3 mb s3://your-docs-site
aws s3 website s3://your-docs-site --index-document index.html --error-document 404.html
aws s3 sync ./docs/ s3://your-docs-site --acl public-read
Lifecycle policy (JSON):
{
"Rules": [{
"ID": "archive-old-backups",
"Status": "Enabled",
"Filter": {"Prefix": "backups/"},
"Transitions": [{
"Days": 30,
"StorageClass": "GLACIER"
}],
"Expiration": {"Days": 365}
}]
}
aws s3api put-bucket-lifecycle-configuration \
--bucket your-backup-bucket \
--lifecycle-configuration file://lifecycle.json
Presigned URLs with Python (boto3):
import boto3, os
s3 = boto3.client("s3", region_name="ap-south-1")
# Generate URL that expires in 1 hour
url = s3.generate_presigned_url(
"get_object",
Params={"Bucket": "your-backup-bucket", "Key": "backups/backup.tar.gz"},
ExpiresIn=3600
)
print(url) # anyone with this URL can download for 1 hour
Update backup.sh to upload to S3 after creating the local archive:
# At end of backup.sh:
if command -v aws &>/dev/null; then
aws s3 cp "$ARCHIVE" "s3://${S3_BUCKET}/backups/$(basename "$ARCHIVE")" && \
echo "Uploaded to S3: s3://${S3_BUCKET}/backups/$(basename "$ARCHIVE")"
fi
WednesdayβThursday: VPC
The default VPC works for most things. Understanding custom VPCs is the interview signal.
Create a custom VPC:
- VPC β Create VPC
- Name:
app-vpc - CIDR:
10.0.0.0/16
- Name:
- Create subnets:
- Public subnet:
10.0.1.0/24, AZ: ap-south-1a - Private subnet:
10.0.2.0/24, AZ: ap-south-1b
- Public subnet:
- Create internet gateway, attach to
app-vpc - Route table for public subnet: add route
0.0.0.0/0 β internet gateway - Route table for private subnet: no internet route (private)
- Launch EC2 in public subnet β it gets a public IP and internet access
- Understand: an EC2 in private subnet has no internet unless you add a NAT gateway (costs ~$0.045/hour β don't provision, just understand)
The interview question version: "What's the difference between a public and private subnet in AWS?" Answer: a public subnet has a route to an internet gateway; a private subnet does not. Resources in a private subnet can reach the internet via a NAT gateway but cannot be reached from the internet directly.
FridayβSaturday: Integrate S3 into the CI/CD pipeline
Update .github/workflows/deploy.yml in vps-multiservice:
- name: Archive previous version to S3
run: |
aws s3 cp docker-compose.yml \
s3://your-backup-bucket/deployments/$(date +%Y%m%d_%H%M%S)/docker-compose.yml
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: ap-south-1
This demonstrates integration of S3 into a CI/CD pipeline. The backup is a real operational artifact β it means you can reconstruct what was deployed at any given time.
Week 9 β CloudWatch
Learning objectives: CloudWatch Logs (log groups, log streams, log agents), metric filters, alarms, custom metrics via boto3
MondayβTuesday: CloudWatch Logs
Install CloudWatch agent on EC2:
wget https://s3.amazonaws.com/amazoncloudwatch-agent/ubuntu/amd64/latest/amazon-cloudwatch-agent.deb
dpkg -i amazon-cloudwatch-agent.deb
Configure /opt/aws/amazon-cloudwatch-agent/bin/config.json:
{
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/nginx/access.log",
"log_group_name": "vps-multiservice",
"log_stream_name": "nginx-access"
},
{
"file_path": "/var/log/nginx/error.log",
"log_group_name": "vps-multiservice",
"log_stream_name": "nginx-error"
}
]
}
}
}
}
/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
-a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -s
The EC2 IAM role needs CloudWatch permissions β add to the role:
{
"Effect": "Allow",
"Action": ["cloudwatch:PutMetricData", "logs:*"],
"Resource": "*"
}
WednesdayβThursday: Metric filters + alarms
In the AWS console:
- CloudWatch β Log groups β
vps-multiserviceβ nginx-access - Create metric filter: pattern
[ip, id, user, timestamp, request, status_code=5*, ...] - Metric name:
nginx-5xx, metric value: 1 - Create alarm on this metric: if sum > 5 in 5 minutes β SNS notification to your email
Billing alarm (if not done in Week 7):
aws cloudwatch put-metric-alarm \
--alarm-name "billing-alert-5usd" \
--alarm-description "Alert when charges exceed $5" \
--metric-name EstimatedCharges \
--namespace AWS/Billing \
--statistic Maximum \
--period 86400 \
--threshold 5 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=Currency,Value=USD \
--evaluation-periods 1 \
--alarm-actions <your-sns-topic-arn>
FridayβSaturday: Custom metrics + Python integration
Add to vps-health-reporter.py β a --push-to-cloudwatch flag:
import boto3
def push_to_cloudwatch(metrics_dict, namespace="VPS/Health"):
cw = boto3.client("cloudwatch", region_name="ap-south-1")
metric_data = []
for name, value in metrics_dict.items():
metric_data.append({
"MetricName": name,
"Value": float(value),
"Unit": "Percent" if "pct" in name.lower() else "Count"
})
if metric_data:
cw.put_metric_data(Namespace=namespace, MetricData=metric_data)
print(f"Pushed {len(metric_data)} metrics to CloudWatch")
Run this as a cron job on the EC2 every 5 minutes:
crontab -e
# Add:
*/5 * * * * /usr/bin/python3 /opt/scripts/vps-health-reporter.py --push-to-cloudwatch
In the CloudWatch console, verify the custom metrics appear under the VPS/Health namespace.
Week 10 β Phase 2 Consolidation
What this week unlocks: You have AWS concretely on your CV. This week polishes everything, prepares interview answers for each project, and pushes the application cadence to 10+/week.
MondayβWednesday: Documentation and GitHub polish
vps-multiservice README must now include:
- Architecture diagram: internet β Route53 (or direct IP) β Nginx (SSL) β Docker network β Flask/Redis/Postgres
- CI/CD flow: push to GitHub β Actions β SSH deploy to EC2 β smoke test
- AWS resources used: EC2, IAM role, S3 (backups), CloudWatch (logs and metrics)
- How to reproduce: step-by-step from zero to running
ops-scripts README: add a section on S3 backup integration and CloudWatch metric publishing.
ThursdayβFriday: Interview preparation β talking about your projects
For each project, prepare a 3-minute verbal answer to "tell me about this project":
- FactorSphere: "FactorSphere is a live academic journal ranking platform I built that serves real users. The architecture is fully edge-native β it runs on Cloudflare Workers rather than a traditional server, which means requests are processed at the edge closest to the user. The backend is a set of serverless microservices analogous to AWS Lambda β one for search, one for ranking, one for data processing. I integrated Pinecone as a managed vector database for semantic search and an LLM for query understanding. CI/CD is automated via GitHub Actions and Wrangler. I've documented the full pipeline in the repo."
- vps-multiservice: "This is a multi-service API stack I deployed on a VPS and on EC2. It runs Flask, Redis, and Postgres in Docker containers orchestrated by Docker Compose. I wrote multi-stage Dockerfiles to minimize image size, configured health checks so Compose knows when each service is ready before starting dependents, and set up Nginx as a reverse proxy with SSL termination. The deployment is automated via GitHub Actions β push to main triggers an SSH deploy to the server with a smoke test. Logs ship to CloudWatch and backups go to S3 via an IAM role."
SaturdayβSunday: Application push
By Week 10, increase to 10β15 applications/week. You now have AWS on your CV. Apply to roles that were out of reach at Week 4:
- "AWS Cloud Engineer (Junior/Fresher)"
- "Cloud Infrastructure Engineer"
- GCC roles that specify AWS in the JD
Weeks 11β12 β Buffer, Deepening, and Interview Prep
Week 11: Take the weakest area β likely whichever of Linux/Docker/AWS you feel least confident explaining verbally β and go deeper. Run through interview scenarios: set up a broken Docker network and diagnose it, break an Nginx config and read the error, terminate an EC2 instance and relaunch from scratch without notes.
Week 12: Final project polish, README quality pass on all repos, continue applying. If interviews are happening, focus prep here on the specific company's stack.
PHASE 3: Weeks 13β24 β Become Genuinely Competent
Weeks 13β15 β Terraform
What this unlocks: Infrastructure as Code is expected at mid-level and increasingly mentioned in junior JDs. More importantly, after doing Weeks 7β9 by hand in the console and CLI, Terraform will click immediately β you already understand the resources, now you're just defining them declaratively.
Week 13: Terraform core concepts β providers, resources, state, plan/apply/destroy. Write Terraform that creates the Week 7/8 infrastructure: EC2, security groups, IAM role, S3 bucket. Run terraform plan to see what it would create. Run terraform apply. Verify it matches what you built manually.
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
backend "s3" {
bucket = "your-terraform-state"
key = "vps-multiservice/terraform.tfstate"
region = "ap-south-1"
}
}
resource "aws_instance" "app" {
ami = "ami-0287a05f0ef0e9d9a" # Ubuntu 22.04, ap-south-1
instance_type = "t2.micro"
key_name = aws_key_pair.deploy.key_name
subnet_id = aws_subnet.public.id
vpc_security_group_ids = [aws_security_group.web.id]
iam_instance_profile = aws_iam_instance_profile.app.name
user_data = file("bootstrap.sh")
tags = {
Name = "vps-multiservice"
Project = "portfolio"
}
}
Week 14: Terraform state, modules, variables, outputs. Write a module for the EC2 + security group pair so it's reusable. Store state remotely in S3 (with DynamoDB locking β use an S3 backend, DynamoDB is free at this scale).
Week 15: Terraform for the full Week 7β9 stack: EC2, VPC, subnets, route tables, internet gateway, IAM roles, S3 buckets, CloudWatch log group. The entire infrastructure is now in infrastructure-iac repo. terraform apply from scratch should produce a working environment.
Repository: infrastructure-iac β contains Terraform for all AWS infrastructure, README explaining what it creates and why, .terraform.lock.hcl committed, state backend configured, variables documented in variables.tf
Weeks 16β17 β Prometheus + Grafana
What this unlocks: Monitoring stack demonstrates operational maturity beyond deployment. "Have you set up monitoring?" is a common interview question; being able to say yes with a GitHub repo is a strong differentiator at junior level.
Week 16: Deploy Prometheus and the Node Exporter on the Hetzner VPS (not EC2 β keep costs at zero here). Prometheus scrapes Node Exporter for system metrics (CPU, memory, disk, network). Scrape the Flask API's /metrics endpoint (add prometheus-flask-exporter to the app). Prometheus config:
# /etc/prometheus/prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: node
static_configs:
- targets: ['localhost:9100']
- job_name: flask-api
static_configs:
- targets: ['localhost:5000']
Run everything in Docker Compose β add Prometheus and Node Exporter to the existing docker-compose.yml in vps-multiservice:
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
ports:
- "9090:9090"
node_exporter:
image: prom/node-exporter:latest
network_mode: host
pid: "host"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
Week 17: Deploy Grafana, connect it to Prometheus as a data source, build dashboards:
- System overview: CPU, memory, disk, network I/O
- Flask API: request rate, error rate, latency (p50, p95, p99)
- Alerts in Alertmanager: email when disk > 80% or API error rate > 5%
Export your dashboards as JSON and commit them to the repo. This means the monitoring stack is reproducible.
Weeks 18β22 β Kubernetes
Five weeks because the surface area is large and the concepts require time to internalize.
Week 18: Kubernetes architecture β what a Node is, what a Pod is, what a Deployment is, how the control plane works (API server, scheduler, controller manager, etcd). Install k3s on a local KVM VM (not on the VPS β k3s with multiple services will use too much RAM on CX22):
curl -sfL https://get.k3s.io | sh -
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
kubectl get nodes
kubectl get pods -A
Write Kubernetes manifests for the Flask API:
apiVersion: apps/v1
kind: Deployment
metadata:
name: flask-api
spec:
replicas: 2
selector:
matchLabels:
app: flask-api
template:
metadata:
labels:
app: flask-api
spec:
containers:
- name: flask-api
image: myapi:latest
ports:
- containerPort: 5000
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: app-secrets
key: database-url
livenessProbe:
httpGet:
path: /health
port: 5000
initialDelaySeconds: 10
periodSeconds: 30
resources:
requests:
memory: "64Mi"
cpu: "100m"
limits:
memory: "128Mi"
cpu: "200m"
---
apiVersion: v1
kind: Service
metadata:
name: flask-api
spec:
selector:
app: flask-api
ports:
- port: 80
targetPort: 5000
Weeks 19β20: ConfigMaps, Secrets, Ingress (using k3s's built-in Traefik ingress controller), PersistentVolumes for Postgres. Deploy the full multi-service stack to Kubernetes.
Weeks 21β22: Rolling updates, rollbacks (kubectl rollout undo deployment/flask-api), Horizontal Pod Autoscaler, namespace isolation, RBAC basics. Write a GitHub Actions workflow that builds a Docker image, pushes to Docker Hub or GitHub Container Registry, and updates the k3s deployment via kubectl set image.
Repository: k8s-fundamentals β all manifests in manifests/, Kustomize or Helm intro in Week 22, README explaining architecture decisions
Weeks 23β24 β Consolidation
Audit all repos: every README answers what it runs, how to run it, what decisions were made, and what you'd do differently with more time. This last part β "what I'd do differently" β is an advanced interview signal. Examples: "I'd add a proper secrets management solution (Vault or AWS Secrets Manager) instead of .env files", "I'd move Terraform state to a proper remote backend with team locking", "I'd add structured logging so Prometheus can ingest log metrics directly."
Write up a "portfolio narrative" β a one-page document (not on GitHub, for your own use) that ties everything together: FactorSphere shows edge-native architecture and LLM integration; vps-multiservice shows traditional server ops; the Terraform repo shows IaC; the monitoring stack shows observability; the Kubernetes repo shows container orchestration. This is your answer to "walk me through your projects" in a senior technical round.
END SECTION
1. GitHub Repositories by Milestone
Week 4 Repositories
ops-scripts β bash health check, backup, and log analyzer scripts; Python endpoint monitor, deploy check, Cloudflare deploy monitor, config validator, and VPS health reporter. Why a recruiter cares: it demonstrates ability to automate real operational tasks β the most common take-home test format. Scripts that run on real infrastructure (not mock data) are distinguishable immediately.
vps-multiservice β Docker Compose stack (Flask + Redis + Postgres), multi-stage Dockerfile, .env.example, Makefile with operational targets, systemd service file, and README documenting architecture decisions. Why a recruiter cares: it's a real multi-service production-style stack running on actual infrastructure, not a tutorial clone. The health check endpoint that checks dependency connectivity is a concrete signal of operational thinking.
factorsphere (existing, updated) β docs/ARCHITECTURE.md and docs/CICD.md added. Why a recruiter cares: live product with real users transforms this from a class project into a production deployment story. The architecture documentation demonstrates that you understand what you built, not just that you ran commands.
Week 8 Repositories
vps-multiservice (updated) β now includes GitHub Actions CI/CD workflow deploying to both VPS and EC2, Nginx reverse proxy configuration, CloudWatch log shipping, S3 backup integration, and updated README with the full AWS architecture. A badge shows the live CI/CD status. Why a recruiter cares: this is a complete junior DevOps portfolio project covering Docker, CI/CD, AWS, monitoring, and backups β the canonical skill set for a junior role.
ops-scripts (updated) β S3 backup integration in backup.sh, CloudWatch metric publishing in vps-health-reporter.py, updated README. Why a recruiter cares: scripts that interact with real cloud services (boto3, AWS CLI) demonstrate practical cloud knowledge beyond "I know what S3 is."
Week 24 Repositories
infrastructure-iac β Terraform modules for the full AWS stack (EC2, VPC, subnets, security groups, IAM, S3). terraform apply from a clean account produces the entire vps-multiservice environment. Why a recruiter cares: IaC in any junior portfolio is unusual and signals maturity; Terraform specifically is the most common IaC tool in NCR DevOps JDs.
monitoring-stack β Prometheus, Grafana, and Alertmanager in Docker Compose, dashboard JSON files committed, Prometheus recording rules, Alertmanager config with email notifications. Why a recruiter cares: monitoring is the answer to "how do you know your service is healthy?" β most junior candidates can't demonstrate this.
k8s-fundamentals β Kubernetes manifests for the multi-service stack running on k3s, GitHub Actions pipeline that builds and deploys to the cluster, RBAC configuration, HPA configuration. Why a recruiter cares: Kubernetes is increasingly mentioned even in junior NCR DevOps JDs; a working cluster with manifests is proof of hands-on experience, not certification prep.
2. Job Titles and Application Timing
Apply now β Week 4
Titles: Junior DevOps Engineer, DevOps Engineer (0-2 years), Cloud Support Engineer, Infrastructure Engineer, Systems Engineer, Associate DevOps Engineer
Platforms: LinkedIn (primary β set alerts), Naukri.com (mandatory β service companies and GCCs post here exclusively), Instahyre, Wellfound
Company targets in NCR: Nagarro (Gurgaon), GlobalLogic (Noida), Publicis Sapient (Gurgaon), NIIT Technologies, Mphasis, smaller funded startups from Antler portfolio and other VC-backed companies. For service companies: HCL Technologies and Wipro have DevOps JDs that are genuine infrastructure work (not all of them β filter by the actual JD content, not just the title).
Direct career pages worth bookmarking: nagarro.com/careers, globallogic.com/careers, publicissapient.com/careers, mphasis.com/careers. These pages often have roles that don't appear on aggregators for 1β2 weeks after posting.
Apply at Week 8
You now have AWS concretely on your CV. Expand to: AWS Cloud Engineer (Junior), Cloud Infrastructure Engineer, Cloud Operations Engineer, DevOps Engineer roles that specify AWS in the JD.
New targets: GCCs that explicitly require AWS β Optum (Noida), Genpact Technology (Gurgaon), EXL Service, WNS, Concentrix Technology. Apply via their career pages and Naukri simultaneously. Your profile is now meaningfully stronger: end-to-end AWS stack (EC2, IAM, S3, VPC, CloudWatch) with CI/CD and a live project to discuss.
Do not apply until Week 24
Platform Engineer β typically requires Kubernetes + Terraform + 2+ years. Platform teams at larger companies in NCR have a higher bar.
Senior DevOps Engineer β even at companies that post "1-3 years experience required", the actual expectation is 2+ years of real DevOps work. Applying earlier wastes your time.
DevSecOps Engineer β security tooling (Vault, Snyk, SAST/DAST pipelines) is a domain requiring deliberate study not on this roadmap.
Cloud Architect β requires 5+ years and cert-level AWS knowledge.
3. What Gets Tested in NCR DevOps Interviews
Phone screen / HR call
You will be asked: your experience summary (have a 90-second version), which tools you've used (answer specifically β not "I know Docker", but "I've written multi-stage Dockerfiles and run Compose stacks on a VPS"), availability and notice period (fresh grad = immediate joining), salary expectation (state your target, not your floor β see Section 5), why DevOps specifically.
They are screening for: articulate communication, genuine experience (not just listed on CV), salary fit, and basic logical thinking. They are not testing technical knowledge here.
Technical round
Linux questions that actually appear:
- "What does the first column of
ps auxoutput mean?" (process owner) - "How do you find which process is using port 8080?" (
ss -tlnp | grep 8080orlsof -i :8080) - "What's the difference between
kill -9andkill -15?" (SIGKILL vs SIGTERM β immediate termination vs graceful) - "A service is failing to start. Walk me through your diagnosis." (
systemctl status,journalctl -u servicename -n 50, check the binary exists and permissions are correct, check the port isn't already in use) - "What is load average?" (average number of processes waiting to run over 1/5/15 minutes, relative to CPU cores)
Docker questions that actually appear:
- "What's the difference between CMD and ENTRYPOINT in a Dockerfile?" (CMD is overridable at run time, ENTRYPOINT is fixed β CMD provides default arguments to ENTRYPOINT)
- "How do containers in a Docker Compose network resolve each other?" (by service name β Docker provides built-in DNS for user-defined networks)
- "What's a multi-stage build and why use it?" (multiple FROM instructions, only the final stage goes to the image β keeps build tools out of the production image, reduces size and attack surface)
- "How do you make a Docker container restart automatically?" (
--restart unless-stoppedorrestart: unless-stoppedin Compose)
Networking questions that actually appear:
- "Walk me through what happens when a user visits a URL" (DNS lookup β TCP connection β TLS handshake β HTTP request β response)
- "What's the difference between a reverse proxy and a load balancer?" (reverse proxy hides the backend; load balancer distributes across multiple backends β Nginx can do both)
- "What does a 502 error mean?" (bad gateway β the proxy received an invalid response from the upstream server)
- "How does SSH authentication work?" (client presents public key, server challenges with random data, client signs it with private key, server verifies the signature with the stored public key)
Scripting:
- You may be asked to write a Bash or Python script live. Common formats: "write a script that checks if these services are running and restarts any that aren't", "write a function that parses this log file and counts occurrences of each status code"
- The evaluator is checking: do you use proper error handling (
set -euo pipefailin bash,try/exceptin Python), do you use functions, do you write readable code
Practical / take-home task
Common formats (2β4 hours):
- "Write a Dockerfile for this Python app, a Docker Compose file that adds Redis, and a health check endpoint" β you'll have a GitHub repo to fork
- "Write a GitHub Actions workflow that runs tests and deploys to a remote server on push to main" β they provide fake SSH credentials or ask you to describe what the secrets would contain
- "Write a Python script that monitors these endpoints and emails you if any return non-200" β tests requests, SMTP, argparse
- "Debug this broken docker-compose.yml" β they give you a file with 3-5 intentional errors
What separates good submissions: health checks are included, .env.example is present, README explains what it does and how to run it, commit history shows incremental work (not one giant commit), you handle error cases explicitly.
4. Common Mistakes That Prevent Freshers From Getting DevOps Jobs in India
Listing technologies you can't defend. If "Kubernetes" is on your CV because you ran kubectl get pods once in a tutorial, an interviewer asking "how does a rolling update work?" will end the conversation. Every technology on your CV must have a project behind it and a clear verbal answer to "tell me how you used this."
GitHub repos without READMEs. A recruiter spending 30 seconds on a repo with no README closes the tab. A repo called docker-practice with three files and no explanation tells a hiring manager nothing about your ability to operate infrastructure. Every repo needs to explain what it runs, why it exists, and what decisions were made.
Applying to renamed helpdesk roles. Many "DevOps Engineer" JDs at service companies are L1 support with a rebranded title. Read the JD carefully β red flags include: "ITIL certification preferred", "ticketing system experience required", "incident management", no mention of Docker/AWS/Linux in the technical requirements. These roles do not build the skills you need and often pay below the floor you've set.
Not knowing your own project in depth. "I deployed a multi-service Docker Compose stack" is not an answer. "I deployed a Flask API with Redis and Postgres in Docker Compose, using multi-stage builds to reduce image size from 1.1GB to 180MB, with health checks so Postgres is confirmed ready before the API starts, running behind Nginx with SSL termination, and automated via GitHub Actions" is an answer. Practice this verbally, not just in your head.
Treating salary floor as the opening number. If you say "I'm looking for around βΉ35,000" at a startup that would pay βΉ55,000 for your profile, you've lost βΉ20,000/month permanently. Know the market rates (see Section 5), start at your target ceiling for each company tier, not your floor.
One-commit GitHub histories. A repo where all the work appears in a single commit titled "add project files" signals that you copied files over rather than built incrementally. Commit as you work. The commit history is evidence of your process.
Not applying early enough. The job search takes time independent of your preparation level. Candidates who start applying at Week 4 get their first offers at Week 8β12. Candidates who wait until they feel "ready" start at Week 8 and get their first offers at Week 14β18. The feedback from real rejections improves your interview performance faster than another week of studying.
Inconsistent positioning. Applying to a DevOps role and then mentioning in the interview that you also do React or full-stack work signals that you don't actually want a DevOps role. Decide on the role type and be consistent in every touch point β resume, LinkedIn, outreach, interview answers.
Vague answers to "tell me about your experience." "I worked at an Antler-backed company on various projects" is vague. "I was a software developer intern at a venture studio backed by Antler, where I shipped production SaaS MVPs across multiple projects β I owned Docker configuration and CI/CD pipelines across three projects, integrated with external APIs, and worked with distributed teams across time zones" is not vague.
Sending generic outreach. A LinkedIn message that could have been sent to anyone ("I am a passionate DevOps professional seeking opportunities") gets ignored. One that references the company's actual stack or a specific JD they posted gets responses. Take 3 minutes per message to make it specific.
5. Realistic Salary Ranges
NCR service companies (TCS, HCL, Wipro, Infosys, Tech Mahindra)
Range: βΉ3.5β5 LPA (CTC). Take-home is ~70β75% of CTC after PF, tax, insurance deductions. At βΉ4.5 LPA CTC, take-home is approximately βΉ28,000β31,000/month.
Your profile puts you toward the upper end of the fresher band, but service companies have fixed fresher slabs that don't move much for individual profile quality. The internship experience may place you in the "experienced fresher" band at some companies (βΉ4.5β5.5 LPA) versus the "campus fresher" band (βΉ3.5β4 LPA).
Honest assessment: these roles are the right floor for negotiation, not the target. The actual work at L1/L2 entry in service companies is often not genuine infrastructure. The learning environment is slower. Treat these as the fallback, not the goal.
Skills that move you up within this band: AWS certification (not worth getting for this tier, though), ITIL knowledge (not worth learning for this tier either). Don't optimize for this band.
NCR funded startups (Series AβC, Antler portfolio)
Range: βΉ5β8 LPA for a genuine junior DevOps role. At βΉ6 LPA, take-home is approximately βΉ42,000β45,000/month depending on structure.
Your Antler studio connection is a direct advantage here. Antler portfolio companies know what their studio interns produce β the signal is stronger than a cold application. FactorSphere as a live production product with real users is unusual for a fresher; most candidates this stage have tutorial clones.
Your realistic target at a well-funded startup: βΉ6β7 LPA at Week 4, negotiable to βΉ7β8 LPA after Week 8 with AWS on the CV.
Skills that move you up this band: AWS working knowledge (Week 8), ability to own deployment pipelines end-to-end without supervision, Docker and Compose at production level (Week 4), Python scripting that actually runs in their infrastructure.
Product companies and GCCs (Nagarro, GlobalLogic, Publicis Sapient, Optum, Genpact Tech, EXL)
Range: βΉ6β10 LPA for cloud/DevOps roles. At βΉ8 LPA, take-home is approximately βΉ55,000β60,000/month.
GCCs pay more than domestic companies for equivalent work β they're paying against a global compensation benchmark. The tradeoff is a higher technical bar at screening and more structured interview processes.
Your profile is competitive here after Week 8 (AWS added). Before Week 8, your serverless and Cloudflare experience is harder to map to what GCC interviewers are looking for; after Week 8, the AWS + Docker + CI/CD story is a clean match.
Realistic target at a GCC: βΉ7β9 LPA at Week 8, with AWS and a demonstrated CI/CD project.
Skills that move you up this band: specific AWS service depth (beyond EC2/S3 β RDS, ECS, ECR, CodePipeline), monitoring/observability (Prometheus or CloudWatch), IaC awareness (Terraform in Phase 3).
Your Week 4 realistic ceiling (before AWS): βΉ6β7 LPA at a funded startup or mid-size product company where your FactorSphere + Docker + CI/CD story lands well.
Your Week 8 realistic ceiling (after AWS): βΉ8β9 LPA at a GCC or strong product company.
Your floor (non-negotiable): βΉ35,000/month = ~βΉ4.2 LPA. Achievable at service companies. Don't accept less β it's below market even at service companies for a profile with live production experience.
Negotiation: When a recruiter asks for your expectation, say the target number for that company tier, not your floor. "I'm looking for βΉ6β7 LPA, based on my production experience and the market for junior DevOps roles in NCR." If they push back, ask what the band is before accepting or declining.
6. Honest Assessment of the Week 4 Target
Week 4 interview-readiness for junior DevOps roles at startups and mid-size product companies in NCR is realistic. This is not inflated encouragement β it's based on what your profile actually produces by Week 4:
You arrive at Week 4 with: Python competency, Linux daily driver experience, Docker from internship, Git fluency, a live production product with real users, and production CI/CD experience. These are not nothing. Most fresher DevOps candidates have none of the internship context and only classroom exposure.
By Week 4 you add: ops-level Linux (systemd, SSH hardening, process management, log analysis), practical networking (DNS, HTTP, SSH tunneling, firewalls), DevOps Python scripting (subprocess, requests, YAML/JSON, proper CLIs), and production Docker + Compose (multi-stage builds, health checks, named volumes, systemd-managed stacks).
This is enough to pass a phone screen and a standard junior technical round at a startup or mid-tier product company. It is not enough to pass a rigorous GCC technical screen that probes AWS depth β that happens after Week 8.
Most likely blocker: The gap between knowing how something works and being able to explain it under mild interview pressure. Docker networking questions specifically β "a container in service A can't reach service B, what do you check?" β require not just knowing the answer but being comfortable walking through it out loud. This gap closes with deliberate verbal practice, not with more studying. Spend 30 minutes every day of Week 4 talking through your projects as if you're in an interview. Record yourself if you can β the first playback will tell you exactly what to fix.
Second most likely blocker: Sparse GitHub repos that don't match your verbal claims. If you say "I built a multi-service Docker Compose stack on a VPS" and the recruiter opens the repo to find three files and no README, you've undermined your own story. Prioritize clean, documented repos with meaningful commit history over adding more features.
Week 6β8 for first offer: Achievable if you apply at 10+ per week starting Week 4 and follow up on applications actively. Realistic for the right startup or product company.
Week 8β10 for first offer: The more likely outcome for most candidates, accounting for interview scheduling delays (common in India), HR processes, and the normal distribution of fit between your profile and open roles. This is not a failure case β it's the median outcome for a candidate with your profile executing this plan.
Week 6β8 for GCC offer: Unlikely. GCC processes in NCR typically take 4β6 weeks from application to offer even when you pass every round. Apply to GCCs at Week 8 and expect the offer to come at Week 12β14.
The 4-week interview-readiness target is sound. The 6β8 week first offer target is optimistic but achievable. The 8β10 week target is realistic without being pessimistic. The answer to "should I start applying at Week 4 even if I don't feel ready?" is yes, unambiguously.
Preserved source markdown
# DevOps Career Roadmap β NCR, 2025β26
---
## DSA β Answered Directly
**Do NCR DevOps, Cloud Support, and Infrastructure roles screen for LeetCode-style DSA?**
At funded startups targeting DevOps specifically: no. The technical screen is scripting (Python or Bash), Linux troubleshooting scenarios, and Docker/Compose tasks. No arrays, graphs, or dynamic programming.
At product companies and GCCs: mostly no, but with a real exception. Some GCCs (Optum, Genpact Tech, Publicis Sapient) route all applicants through a standardized first-round online assessment that includes basic coding β not algorithmic complexity, but "write a function that does X" type problems a second-year CS student should handle. If you apply to these through their career portal (not via referral or recruiter), you may hit this screen. The coding is at the level of "reverse a string without slicing" or "count words in a paragraph" β not LeetCode mediums. If you've written Python for two years, you're fine. No prep needed.
At service companies (TCS, Wipro, HCL, Infosys): their entry tests include aptitude + basic coding. The coding section is trivial CS101 material, not competitive programming. If this is the blocker, 2 hours of practice on basic Python problems clears it.
**Conclusion:** No LeetCode prep is needed. No LeetCode appears in this roadmap under any framing. The only coding you'll write as interview prep is ops-relevant Python scripting, which is on the roadmap as a skill anyway.
---
## Technology Rationale Table
| Technology | Immediate Employability Impact | Learnable on the Job? | Why It's Here |
|---|---|---|---|
| Linux (ops-specific) | High | Partially | Every VPS, EC2, and container runs Linux; ops-level commands are tested directly in technical rounds |
| Git (infra conventions) | Medium | Yes | Ops repos, GitOps patterns, and READMEs are how hiring managers evaluate your actual work product |
| Networking | High | Partially | DNS, SSH, firewall, and HTTP debugging appear explicitly in junior DevOps technical screens in NCR |
| Python scripting | High | Partially | Automation scripts are the most common take-home task format; real scripts separate you from CV-padders |
| Docker | High | No | In nearly every junior DevOps JD in NCR; multi-stage builds and Compose are now baseline expectations |
| Docker Compose | High | Partially | Multi-service Compose stacks are the standard deployment unit at companies without Kubernetes |
| Nginx | High | Partially | Reverse proxy knowledge is expected at junior level; SSL termination and upstreams come up in interviews |
| GitHub Actions | High | Yes | CI/CD is table stakes; VPS deploy pipelines are more transferable than Cloudflare-specific ones |
| AWS EC2/IAM/S3/VPC | High | Partially | Most NCR product companies and GCCs run on AWS; differentiates you from candidates who only know on-prem |
| CloudWatch | Medium | Yes | Logging and basic monitoring show operational awareness; AWS-native so low friction to learn |
| Terraform | Medium | Yes | Increasingly in mid-level JDs; rarely a hard gate at junior level in NCR, but Phase 3 depth pays off post-hire |
| Prometheus | Medium | Yes | Monitoring stack is a bonus at junior level; signals genuine ops thinking beyond just deployment |
| Grafana | Medium | Yes | Dashboard work demonstrates operational maturity; trivial to add once Prometheus is running |
| Kubernetes | Medium | No | Appearing in NCR JDs even at junior level now; large surface area justifies Phase 3 placement |
---
## PHASE 1: Weeks 1β4 β Become Employable
---
### Week 1 β Linux (Ops-Specific) + Git (Infra Conventions)
**What this week unlocks:** Your VPS becomes a real work environment, not a toy. You can answer Linux troubleshooting questions in interviews. FactorSphere gets professional documentation that turns a live project into an interview story.
**Study hours: 42**
**Learning objectives:**
- Ops-level networking stack: reading active connections, capturing traffic, diagnosing from the command line
- SSH hardening: key-only auth, non-root user, config file discipline
- systemd: writing, enabling, and managing service units from scratch
- Bash scripting for automation: health checks, backups, log analysis
- Process and resource management at ops level
- Log analysis with journalctl and standard log tooling
- How infrastructure repos differ from application repos; writing READMEs a hiring manager actually reads
- Professional documentation for FactorSphere CI/CD pipeline and architecture
**Technologies:** Linux (Ubuntu 22.04 on Hetzner VPS), Bash, systemd, UFW, SSH, journalctl, Git
---
**MondayβTuesday: VPS baseline and network stack**
SSH into the Hetzner VPS. If you're logging in as root, fix that first.
```bash
adduser deploy
usermod -aG sudo deploy
```
SSH hardening β edit `/etc/ssh/sshd_config`:
```
PasswordAuthentication no
PermitRootLogin no
AllowUsers deploy
```
Copy your public key to the new user:
```bash
ssh-copy-id -i ~/.ssh/id_ed25519.pub deploy@<vps-ip>
```
Generate a dedicated key if you don't have one: `ssh-keygen -t ed25519 -C "vps-deploy-key"`
Restart sshd: `systemctl restart sshd`. Verify you can log in as `deploy` before ending the root session.
UFW setup:
```bash
ufw default deny incoming
ufw default allow outgoing
ufw allow from <your-home-ip> to any port 22
ufw allow 80/tcp
ufw allow 443/tcp
ufw enable
ufw status verbose
```
Networking stack commands β run these and understand every line of output:
```bash
ip addr show
ip route show
ss -tlnp # listening TCP sockets with process names
ss -tulnp # TCP + UDP
netstat -tlnp # older but still appears in interviews
```
Install tcpdump: `apt install tcpdump`. Run:
```bash
tcpdump -i eth0 port 22 # watch your own SSH session
tcpdump -i eth0 port 80 -A # see HTTP traffic in ASCII
```
You're not becoming a packet analysis expert. You're learning to answer "how would you debug a connectivity issue" with real commands.
---
**Wednesday: systemd**
Write a real systemd service. Create a minimal Python HTTP server first:
```python
# /opt/healthapi/server.py
import http.server, json
class Handler(http.server.BaseHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.send_header('Content-Type', 'application/json')
self.end_headers()
self.wfile.write(json.dumps({"status": "ok"}).encode())
def log_message(self, *args): pass
http.server.HTTPServer(('', 8080), Handler).serve_forever()
```
Service unit `/etc/systemd/system/healthapi.service`:
```ini
[Unit]
Description=Health API
After=network.target
[Service]
Type=simple
User=deploy
WorkingDirectory=/opt/healthapi
ExecStart=/usr/bin/python3 /opt/healthapi/server.py
Restart=on-failure
RestartSec=5
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target
```
```bash
systemctl daemon-reload
systemctl enable healthapi
systemctl start healthapi
systemctl status healthapi
curl http://localhost:8080
journalctl -u healthapi -f # tail logs
journalctl -u healthapi --since "1 hour ago"
journalctl -p err -u healthapi # errors only
```
Stop the service, break it intentionally (wrong path), observe `systemctl status` output. This is what debugging looks like.
---
**Thursday: Bash scripting for automation**
Write three scripts. These go on GitHub. They are real deliverables, not exercises.
`health-check.sh` β VPS health report:
```bash
#!/bin/bash
set -euo pipefail
THRESHOLD_DISK=80
THRESHOLD_MEM=90
echo "=== VPS Health Check $(date) ==="
# Disk
DISK_USE=$(df / | awk 'NR==2 {print $5}' | tr -d '%')
echo "Disk usage: ${DISK_USE}%"
[ "$DISK_USE" -gt "$THRESHOLD_DISK" ] && echo "WARNING: disk above ${THRESHOLD_DISK}%" >&2
# Memory
MEM_TOTAL=$(free -m | awk '/Mem:/ {print $2}')
MEM_USED=$(free -m | awk '/Mem:/ {print $3}')
MEM_PCT=$(( MEM_USED * 100 / MEM_TOTAL ))
echo "Memory usage: ${MEM_PCT}% (${MEM_USED}/${MEM_TOTAL} MB)"
# Services β edit list for your environment
for SVC in healthapi nginx docker; do
STATUS=$(systemctl is-active "$SVC" 2>/dev/null || echo "not-installed")
echo "Service $SVC: $STATUS"
done
# Open ports
echo "Listening ports:"
ss -tlnp | awk 'NR>1 {print $1, $4, $6}'
```
`backup.sh` β timestamped archive:
```bash
#!/bin/bash
set -euo pipefail
SRC="${1:?Usage: backup.sh <source-dir>}"
DEST="/var/backups"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
ARCHIVE="${DEST}/backup_${TIMESTAMP}.tar.gz"
mkdir -p "$DEST"
tar czf "$ARCHIVE" "$SRC"
echo "Backup created: $ARCHIVE ($(du -sh "$ARCHIVE" | cut -f1))"
# Keep last 7 backups
ls -t "${DEST}"/backup_*.tar.gz | tail -n +8 | xargs -r rm
echo "Old backups pruned. Current count: $(ls "${DEST}"/backup_*.tar.gz | wc -l)"
```
`log-analyzer.sh` β parse journalctl for errors in the last N hours:
```bash
#!/bin/bash
HOURS="${1:-24}"
echo "=== Error summary: last ${HOURS} hours ==="
journalctl --since "${HOURS} hours ago" -p err --no-pager | \
awk '{print $5}' | sort | uniq -c | sort -rn | head -20
```
Make them executable (`chmod +x`), test them, commit them with a meaningful message: `feat(scripts): add VPS health check, backup, and log analysis scripts`.
---
**FridayβSaturday: Process management, disk, LVM awareness, log deep dive**
Process management:
```bash
ps aux --sort=-%cpu | head -10 # top CPU consumers
ps aux --sort=-%mem | head -10 # top memory consumers
kill -9 <pid> # SIGKILL
kill -15 <pid> # SIGTERM (graceful)
nice -n 10 <command> # lower priority
renice -n 5 -p <pid> # change running process priority
```
`top` and `htop` β know what load average means. A load average of 1.0 on a single-core machine = 100% utilization. On a 4-core machine, 1.0 = 25%. This comes up in interviews.
Disk:
```bash
df -h # filesystem usage
du -sh /var/* # directory sizes
lsblk # block devices
fdisk -l # partition table
```
LVM β you may not have LVM on the VPS, but understand the commands:
```bash
pvs # physical volumes
vgs # volume groups
lvs # logical volumes
```
Log analysis:
```bash
journalctl --since "2025-01-01" --until "2025-01-02"
journalctl -u nginx --no-pager | grep "502" | wc -l
grep -E "ERROR|WARN" /var/log/syslog | tail -50
zcat /var/log/syslog.2.gz | grep ERROR # compressed log files
```
---
**Sunday: Git infra conventions + FactorSphere documentation**
This is the most interview-impactful work of the week.
Create `docs/` in the FactorSphere repo. Write two files:
**`ARCHITECTURE.md`** β cover: why Cloudflare Workers instead of a traditional server (latency, no cold starts at edge, cost at zero users), how requests flow (DNS β Cloudflare edge β Worker β Pinecone/LLM β response), data flow for the ranking pipeline (source aggregation β processing Workers β Pinecone indexing β query Workers), why Pinecone over a hosted Postgres vector extension (managed, no infra to maintain), how the frontend on Cloudflare Pages is decoupled from the Worker API, what the tradeoffs are (no persistent state in Workers, Pinecone vendor lock-in, cold start behavior). Use a Mermaid diagram:
```
graph LR
User --> CF_Edge[Cloudflare Edge]
CF_Edge --> Worker_API[Workers API]
Worker_API --> Pinecone[(Pinecone Vector DB)]
Worker_API --> LLM[LLM Inference]
CF_Pages --> Worker_API
```
**`CICD.md`** β cover: what triggers a deploy (push to `main`), the GitHub Actions workflow steps (lint β type check β Wrangler deploy), what `wrangler deploy` actually does (bundles the Worker, uploads to Cloudflare's edge network), how Cloudflare Pages handles frontend deploys automatically (build hook on push), what happens on failure (GitHub Actions marks the workflow run as failed, Wrangler does not promote the broken version β previous version stays live), how to roll back (revert commit + push, or `wrangler rollback` to a previous deployment ID), what environment variables are injected at deploy time vs stored as Cloudflare secrets.
Write this so you can recite it verbally in a 3-minute interview answer. That's the test.
**Infra repo conventions:**
- READMEs in infrastructure repos answer: what this runs, how to run it, what environment variables it needs, what the architecture looks like, and what decisions were made and why
- Application repos explain features; infrastructure repos explain operations
- Commit messages in infra repos are imperative and specific: `fix(nginx): increase worker_connections to handle spike load` not `update config`
- Store `.env.example` with all variable names and no values. Never `.env`.
**Deliverables, Week 1:**
- Hetzner VPS: non-root `deploy` user, SSH key-only auth, UFW configured, `healthapi` systemd service running
- GitHub repo `ops-scripts`: `health-check.sh`, `backup.sh`, `log-analyzer.sh` with meaningful commits and a README
- FactorSphere repo: `docs/ARCHITECTURE.md` and `docs/CICD.md` β professional, interview-ready
---
### Week 2 β Networking (Practical)
**What this week unlocks:** You can answer every networking question in a junior DevOps technical round. DNS debugging, HTTP troubleshooting, SSH advanced usage, and firewall diagnosis are the most common technical screen topics. This week makes you competent at all of them.
**Study hours: 42**
**Learning objectives:**
- DNS: resolution chain, record types, TTL, cache behavior, practical debugging tools
- HTTP: headers, status codes, TLS handshake, what curl reveals
- TCP/IP: three-way handshake, port states, socket inspection
- SSH: config file, tunneling, port forwarding, agent forwarding
- Firewalls: UFW rule management, iptables literacy, nftables awareness
- tcpdump for real diagnosis
---
**MondayβTuesday: DNS**
Install: `apt install dnsutils` on the VPS if not present.
```bash
dig factorsphere.org # full answer section
dig +short factorsphere.org # just the IP
dig +trace factorsphere.org # full resolution chain from root
dig @8.8.8.8 factorsphere.org # force a specific resolver
dig @1.1.1.1 factorsphere.org # Cloudflare resolver
dig MX factorsphere.org # mail records
dig TXT factorsphere.org # SPF, DKIM, verification records
dig CNAME www.factorsphere.org # canonical name
dig -x <ip-address> # reverse lookup
```
Read the AUTHORITY SECTION and ADDITIONAL SECTION in `dig` output. Understand what TTL means β if you change a DNS record, traffic won't switch until TTL expires. This is how to answer "how long does DNS propagation take?"
Watch DNS queries in real time:
```bash
tcpdump -i any port 53 -n
# Open another terminal and run: dig google.com
# Watch the query and response packets appear
```
Understand `/etc/resolv.conf` (which nameserver your system queries) and `/etc/hosts` (local override, checked before DNS). Add an entry to `/etc/hosts` that maps a fake hostname to localhost, verify it resolves, then remove it.
Configure a real subdomain if you have a domain β point `vps.yourdomain.com` to your Hetzner IP as an A record. Verify with `dig`. This demonstrates you've actually managed DNS, not just read about it.
---
**Wednesday: HTTP in depth**
```bash
curl -v https://factorsphere.org # verbose: see TLS handshake, headers, body
curl -I https://factorsphere.org # HEAD request only β no body
curl -L https://factorsphere.org # follow redirects
curl -X POST https://api.example.com/endpoint \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d '{"key": "value"}'
curl -o /dev/null -s -w "%{http_code}\n" https://factorsphere.org # just status code
```
Read every header in the `curl -v` output. Know what these mean:
- `Cache-Control: max-age=86400` β browser can cache for 24 hours
- `X-Forwarded-For` β original client IP when behind a proxy
- `Strict-Transport-Security` β forces HTTPS
- `CF-RAY` β Cloudflare request ID, useful for debugging Workers
- `Content-Encoding: gzip` β response is compressed
HTTP status codes you must know cold: 200, 201, 204, 301, 302, 304, 400, 401, 403, 404, 422, 429, 500, 502, 503, 504. Know the difference between 401 and 403, between 502 and 503.
TLS handshake β be able to describe: client hello β server hello + certificate β client verifies cert β key exchange β symmetric session established. Not cryptography depth, but the sequence.
```bash
openssl s_client -connect factorsphere.org:443 # inspect the TLS certificate
openssl x509 -in cert.pem -noout -dates # check cert expiry
```
---
**Thursday: SSH advanced**
`~/.ssh/config` β create this file:
```
Host vps
HostName <your-vps-ip>
User deploy
IdentityFile ~/.ssh/id_ed25519
ServerAliveInterval 60
Host github.com
IdentityFile ~/.ssh/id_ed25519_github
User git
```
Now `ssh vps` instead of `ssh -i ~/.ssh/id_ed25519 deploy@<ip>`.
Local port forwarding β access a service on the VPS that isn't exposed publicly:
```bash
ssh -L 5432:localhost:5432 vps
# Now psql -h localhost -p 5432 connects to Postgres on the VPS
```
Remote port forwarding β expose a local service through the VPS (useful for demos):
```bash
ssh -R 8080:localhost:3000 vps
# Traffic to vps:8080 now forwards to your local machine's port 3000
```
ProxyJump β hop through a bastion:
```bash
ssh -J bastion.example.com internal-server.example.com
# Or in config:
# Host internal
# ProxyJump bastion
```
SSH agent:
```bash
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_ed25519
ssh-add -l # list loaded keys
```
---
**FridayβSaturday: Firewalls and network diagnostics**
UFW advanced:
```bash
ufw status numbered # numbered rules for deletion
ufw delete 3 # delete rule 3
ufw allow from 10.0.0.0/8 to any port 5432 # Postgres from internal network only
ufw logging on
tail -f /var/log/ufw.log
```
iptables β you need to read it, not memorize it:
```bash
iptables -L -n -v # list all rules with packet/byte counts
iptables -L INPUT -n -v # just INPUT chain
iptables -t nat -L -n # NAT table
```
UFW uses iptables underneath. When UFW allows port 80, it adds an iptables ACCEPT rule. This is how to answer "how does UFW work under the hood?"
nftables:
```bash
nft list ruleset # view current rules
```
Network diagnostics:
```bash
ping -c 4 8.8.8.8 # basic reachability
traceroute 8.8.8.8 # hop-by-hop path
mtr 8.8.8.8 # traceroute + ping combined, live
ss -s # socket statistics summary
ss -tnp state established # established TCP connections
```
---
**Sunday: Build the networking diagnostics script**
This goes on GitHub as a real project.
`endpoint-monitor.py` β check a list of endpoints and report health:
```python
#!/usr/bin/env python3
"""
Endpoint health monitor β checks DNS, HTTP reachability, and SSL cert expiry.
Usage: python3 endpoint-monitor.py --config endpoints.yaml [--json]
"""
import argparse, json, socket, ssl, datetime, sys
import urllib.request, urllib.error
import yaml
def check_dns(hostname):
try:
ip = socket.gethostbyname(hostname)
return {"status": "ok", "ip": ip}
except socket.gaierror as e:
return {"status": "error", "error": str(e)}
def check_http(url, timeout=10):
try:
req = urllib.request.Request(url, headers={"User-Agent": "endpoint-monitor/1.0"})
with urllib.request.urlopen(req, timeout=timeout) as r:
return {"status": "ok", "http_code": r.status, "latency_ms": None}
except urllib.error.HTTPError as e:
return {"status": "error", "http_code": e.code}
except Exception as e:
return {"status": "error", "error": str(e)}
def check_ssl(hostname, port=443):
try:
ctx = ssl.create_default_context()
with ctx.wrap_socket(socket.create_connection((hostname, port), timeout=10),
server_hostname=hostname) as s:
cert = s.getpeercert()
expiry_str = cert['notAfter']
expiry = datetime.datetime.strptime(expiry_str, "%b %d %H:%M:%S %Y %Z")
days_left = (expiry - datetime.datetime.utcnow()).days
return {"status": "ok", "expires": expiry_str, "days_remaining": days_left}
except Exception as e:
return {"status": "error", "error": str(e)}
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--config", required=True)
parser.add_argument("--json", action="store_true")
args = parser.parse_args()
with open(args.config) as f:
config = yaml.safe_load(f)
results = {}
for endpoint in config.get("endpoints", []):
name = endpoint["name"]
url = endpoint["url"]
hostname = url.split("//")[-1].split("/")[0]
results[name] = {
"dns": check_dns(hostname),
"http": check_http(url),
"ssl": check_ssl(hostname) if url.startswith("https") else None,
}
if args.json:
print(json.dumps(results, indent=2))
else:
for name, checks in results.items():
print(f"\n{name}:")
for check_name, result in checks.items():
if result:
status = "β" if result["status"] == "ok" else "β"
print(f" {status} {check_name}: {result}")
any_failure = any(
c["status"] == "error"
for r in results.values()
for c in r.values() if c
)
sys.exit(1 if any_failure else 0)
if __name__ == "__main__":
main()
```
`endpoints.yaml`:
```yaml
endpoints:
- name: FactorSphere
url: https://factorsphere.org
- name: FactorSphere API
url: https://api.factorsphere.org
- name: VPS
url: http://<your-vps-ip>:8080
```
**Deliverables, Week 2:**
- `ops-scripts` repo updated: `endpoint-monitor.py` added with `endpoints.yaml.example` and updated README
- Can verbally answer in an interview: "Walk me through what happens when a user types factorsphere.org and hits Enter" β from DNS query through TLS through Cloudflare edge to Worker response
- VPS subdomain configured (if you have a domain) and verified with `dig`
---
### Week 3 β Python DevOps Scripting
**What this week unlocks:** Python automation is the most common take-home task format in NCR DevOps interviews. By the end of this week you have a GitHub repo with real scripts and can complete a take-home assignment in 2 hours rather than 4.
**Study hours: 42**
**Learning objectives:**
- `subprocess`: running shell commands from Python, capturing output, handling return codes
- `os`/`sys`: environment variables, path operations, argument handling
- `argparse`: building proper CLI tools with flags and help text
- `requests`: HTTP calls, error handling, timeouts, retries
- `yaml`/`json`: config parsing, output generation
- Writing scripts that do real infrastructure work β deploy checks, API monitors, config validators, log parsers
---
**MondayβTuesday: subprocess + os + sys + argparse**
`subprocess` β the right way:
```python
import subprocess
# Run a command, capture output, check return code
result = subprocess.run(
["systemctl", "is-active", "nginx"],
capture_output=True,
text=True,
timeout=10
)
print(result.stdout.strip()) # "active" or "inactive"
print(result.returncode) # 0 = active, 3 = inactive
# Run shell command (avoid when possible β harder to escape safely)
result = subprocess.run(
"df -h | grep '/$'",
shell=True, capture_output=True, text=True
)
# Raise on non-zero exit
subprocess.run(["docker", "ps"], check=True) # raises CalledProcessError if docker fails
```
`os` and `sys`:
```python
import os, sys
# Environment variables
api_key = os.environ.get("CF_API_KEY") # returns None if not set, no KeyError
api_key = os.environ["CF_API_KEY"] # raises KeyError if not set β use when required
# Paths
os.path.join("/var/log", "nginx", "access.log")
os.path.exists("/etc/nginx/nginx.conf")
os.path.abspath("../config")
# Script directory (useful for loading config files relative to script)
script_dir = os.path.dirname(os.path.abspath(__file__))
config_path = os.path.join(script_dir, "config.yaml")
# Exit with status code (important for shell scripts that call your Python)
sys.exit(0) # success
sys.exit(1) # failure
```
`argparse` β build a real CLI:
```python
import argparse
def main():
parser = argparse.ArgumentParser(
description="Check service health on a VPS",
formatter_class=argparse.RawDescriptionHelpFormatter
)
parser.add_argument("--host", required=True, help="VPS hostname or IP")
parser.add_argument("--port", type=int, default=8080, help="Health endpoint port")
parser.add_argument("--service", action="append", dest="services",
help="systemd service to check (repeat for multiple)")
parser.add_argument("--verbose", "-v", action="store_true")
parser.add_argument("--output-format", choices=["text", "json"], default="text")
args = parser.parse_args()
# args.host, args.port, args.services, args.verbose, args.output_format
```
Build `deploy-check.py` β takes a service name as argument, checks it's active, verifies the port is listening, hits the health endpoint, prints pass/fail with exit code:
```python
#!/usr/bin/env python3
"""
Post-deploy smoke test: checks systemd service, port, and HTTP health endpoint.
Usage: python3 deploy-check.py --service nginx --port 80 --endpoint /health
"""
import argparse, subprocess, socket, sys
import requests
def check_service(name):
r = subprocess.run(["systemctl", "is-active", name],
capture_output=True, text=True)
return r.stdout.strip() == "active"
def check_port(port, host="localhost"):
try:
with socket.create_connection((host, port), timeout=5):
return True
except (socket.timeout, ConnectionRefusedError, OSError):
return False
def check_http(url, timeout=10):
try:
r = requests.get(url, timeout=timeout)
return r.status_code < 500, r.status_code
except requests.exceptions.RequestException as e:
return False, str(e)
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--service", required=True)
parser.add_argument("--port", type=int, required=True)
parser.add_argument("--endpoint", default="/health")
parser.add_argument("--host", default="localhost")
args = parser.parse_args()
checks = [
("service active", check_service(args.service)),
("port listening", check_port(args.port, args.host)),
]
http_ok, http_detail = check_http(f"http://{args.host}:{args.port}{args.endpoint}")
checks.append((f"HTTP {args.endpoint}", http_ok))
passed = all(ok for _, ok in checks)
for name, ok in checks:
status = "PASS" if ok else "FAIL"
print(f"[{status}] {name}")
sys.exit(0 if passed else 1)
if __name__ == "__main__":
main()
```
---
**Wednesday: requests + API interaction**
```python
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
# Basic
r = requests.get("https://api.example.com/status", timeout=10)
r.raise_for_status() # raises HTTPError for 4xx/5xx
data = r.json()
# With headers
r = requests.get(
"https://api.cloudflare.com/client/v4/accounts",
headers={"Authorization": f"Bearer {os.environ['CF_API_TOKEN']}"},
timeout=10
)
# Retry logic
session = requests.Session()
retry = Retry(total=3, backoff_factor=1, status_forcelist=[500, 502, 503, 504])
adapter = HTTPAdapter(max_retries=retry)
session.mount("https://", adapter)
r = session.get("https://api.example.com", timeout=10)
```
Build `cloudflare-deploy-monitor.py` β checks the status of the most recent FactorSphere Workers deployment via the Cloudflare API. Reads `CF_API_TOKEN` and `CF_ACCOUNT_ID` from environment variables. Outputs the deployment status and timestamp. Returns exit code 1 if the last deployment failed. This is a real script that interacts with a real production system.
Cloudflare API endpoint to use: `GET /accounts/{account_id}/workers/scripts/{script_name}/deployments` (check current CF API docs for exact path β the concept is what matters here, the implementation requires your actual credentials).
---
**Thursday: YAML/JSON parsing**
```python
import yaml, json
# YAML
with open("config.yaml") as f:
config = yaml.safe_load(f) # safe_load, not load β avoids arbitrary code execution
# JSON
with open("output.json") as f:
data = json.load(f)
# Pretty print
print(json.dumps(data, indent=2, default=str)) # default=str handles datetime objects
# Write YAML
with open("generated-config.yaml", "w") as f:
yaml.dump(config, f, default_flow_style=False)
```
Build `config-validator.py` β reads a YAML deployment config, validates required keys exist and have the right types, outputs errors if invalid, exits 0 if valid:
```python
REQUIRED_FIELDS = {
"service_name": str,
"image": str,
"port": int,
"health_endpoint": str,
"environment": dict,
}
```
This is a real pattern β CI pipelines often validate config before proceeding.
---
**FridayβSaturday: Full ops script**
Build `vps-health-reporter.py` β this is the Week 3 anchor deliverable. It's a single script that does everything:
```
Usage: python3 vps-health-reporter.py [--json] [--verbose] [--output FILE]
Checks:
- Disk usage per filesystem (warns if > configurable threshold)
- Memory usage
- systemd services (configured list)
- Port reachability (configured list)
- HTTP endpoints (configured list with expected status codes)
- SSL cert expiry for HTTPS endpoints (warns if < 30 days)
Output:
- Text table by default
- JSON with --json flag
- Writes to file with --output flag
- Exit code 0 if all checks pass, 1 if any fail
Config: reads from vps-health-reporter.yaml:
services:
- nginx
- healthapi
- docker
ports:
- host: localhost
port: 80
name: nginx-http
- host: localhost
port: 8080
name: healthapi
endpoints:
- url: http://localhost:8080
expected_status: 200
name: healthapi-root
- url: https://factorsphere.org
expected_status: 200
name: factorsphere
disk_threshold_pct: 80
memory_threshold_pct: 90
ssl_warning_days: 30
```
This script uses `subprocess`, `requests`, `ssl`, `socket`, `yaml`, `json`, `argparse`, `sys`, `os`. It solves a real problem β paste it into any VPS and get a health report.
---
**Sunday: Polish and documentation**
`requirements.txt`:
```
requests==2.31.0
PyYAML==6.0
```
Meaningful commit history β if all your commits are `add scripts`, you're doing it wrong. Examples of correct commit messages:
```
feat(deploy-check): add HTTP health endpoint validation
fix(endpoint-monitor): handle SSL cert expiry for non-HTTPS endpoints gracefully
feat(vps-reporter): add configurable disk/memory thresholds from YAML config
refactor(cloudflare-monitor): extract API client to reusable class
```
README for the repo: what problem each script solves, prerequisites, installation (`pip install -r requirements.txt`), example usage for each script, example output.
**Deliverables, Week 3:**
- `ops-scripts` repo: 5+ scripts (`health-check.sh`, `backup.sh`, `log-analyzer.sh`, `endpoint-monitor.py`, `deploy-check.py`, `cloudflare-deploy-monitor.py`, `config-validator.py`, `vps-health-reporter.py`), `requirements.txt`, `endpoints.yaml.example`, `vps-health-reporter.yaml.example`, clean README, 20+ meaningful commits
---
### Week 4 β Docker (Production) + Docker Compose + Application Strategy
**What this week unlocks:** Docker and Compose are tested in almost every junior DevOps interview in NCR. By the end of this week you have a real multi-service stack running on the VPS β something you can show and explain in a technical round. Sunday is the application strategy session.
**Study hours: 42**
**Learning objectives:**
- Multi-stage Dockerfiles: how and why, not just what
- Image optimization: layer caching order, `.dockerignore`, minimal base images
- Container networking: how containers resolve each other by name in Compose
- Volume management: named volumes vs bind mounts, when each is appropriate
- Health checks: HEALTHCHECK instruction, container health states, `depends_on` conditions
- Running Docker Compose stacks as persistent services on VPS
- Docker Compose: full multi-service stack, `.env` files, override files, Makefile
---
**MondayβTuesday: Production Dockerfile patterns**
You've used Docker. These are the patterns that distinguish junior from intern-level usage.
Multi-stage build for a Python app:
```dockerfile
# syntax=docker/dockerfile:1
# ββ Stage 1: builder ββββββββββββββββββββββββββββββββββββββββββ
FROM python:3.11-slim AS builder
WORKDIR /build
# Copy only requirements first β Docker caches this layer
# If requirements.txt doesn't change, this layer is reused on rebuild
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt
# ββ Stage 2: runtime ββββββββββββββββββββββββββββββββββββββββββ
FROM python:3.11-slim AS runtime
# Create non-root user
RUN useradd --create-home --shell /bin/bash appuser
WORKDIR /app
# Copy only the installed packages from builder
COPY --from=builder /root/.local /home/appuser/.local
COPY --chown=appuser:appuser . .
USER appuser
# PATH must include user-installed packages
ENV PATH=/home/appuser/.local/bin:$PATH
EXPOSE 5000
HEALTHCHECK --interval=30s --timeout=10s --start-period=10s --retries=3 \
CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:5000/health')" || exit 1
CMD ["python", "-m", "gunicorn", "--bind", "0.0.0.0:5000", "--workers", "2", "app:app"]
```
`.dockerignore`:
```
.git
.gitignore
__pycache__
*.pyc
*.pyo
.pytest_cache
.env
*.env
node_modules
.venv
dist
build
*.egg-info
README.md
docs/
```
Why layer caching order matters: `COPY requirements.txt .` + `RUN pip install` before `COPY . .` means requirements are cached as long as `requirements.txt` doesn't change. If you `COPY . .` first, every file change invalidates the pip install layer. Run `docker build` twice β second run should show `CACHED` for the pip layer.
Compare image sizes:
```bash
docker images | grep myapp
# naive (python:3.11): ~1.1GB
# multi-stage (python:3.11-slim): ~200MB
# distroless: ~100MB
```
Container networking:
```bash
docker network create mynet
docker run -d --name db --network mynet postgres:15
docker run -it --network mynet alpine ping db # resolves by container name
docker inspect mynet # see connected containers and IP assignments
```
---
**Wednesday: Docker Compose**
Build a `docker-compose.yml` for a real multi-service stack:
```yaml
version: "3.9"
services:
api:
build:
context: ./api
dockerfile: Dockerfile
image: myapi:latest
container_name: myapi
ports:
- "5000:5000"
environment:
- DATABASE_URL=postgresql://postgres:${POSTGRES_PASSWORD}@db:5432/appdb
- REDIS_URL=redis://cache:6379/0
- SECRET_KEY=${SECRET_KEY}
depends_on:
db:
condition: service_healthy
cache:
condition: service_started
restart: unless-stopped
networks:
- backend
db:
image: postgres:15-alpine
container_name: mydb
environment:
POSTGRES_DB: appdb
POSTGRES_USER: postgres
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
volumes:
- postgres_data:/var/lib/postgresql/data
- ./db/init.sql:/docker-entrypoint-initdb.d/init.sql:ro
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres -d appdb"]
interval: 10s
timeout: 5s
retries: 5
restart: unless-stopped
networks:
- backend
cache:
image: redis:7-alpine
container_name: myredis
command: redis-server --appendonly yes
volumes:
- redis_data:/data
restart: unless-stopped
networks:
- backend
volumes:
postgres_data:
redis_data:
networks:
backend:
driver: bridge
```
`.env` (never commit this β commit `.env.example`):
```
POSTGRES_PASSWORD=changeme_in_production
SECRET_KEY=changeme_in_production
```
Override for development β `docker-compose.override.yml` (only loaded locally, not in CI):
```yaml
services:
api:
volumes:
- ./api:/app # bind mount for hot reload in dev
environment:
- DEBUG=true
```
Production doesn't have the override file, so bind mounts don't exist in prod.
---
**Thursday: Containers as systemd services + resource limits**
Running Docker Compose as a systemd service on the VPS:
`/etc/systemd/system/myapp.service`:
```ini
[Unit]
Description=MyApp Docker Compose Stack
Requires=docker.service
After=docker.service network-online.target
Wants=network-online.target
[Service]
Type=oneshot
RemainAfterExit=yes
WorkingDirectory=/opt/myapp
ExecStart=/usr/bin/docker compose up -d --remove-orphans
ExecStop=/usr/bin/docker compose down
TimeoutStartSec=120
TimeoutStopSec=30
User=deploy
[Install]
WantedBy=multi-user.target
```
Resource limits in Compose:
```yaml
services:
api:
deploy:
resources:
limits:
cpus: "0.5"
memory: 256M
reservations:
memory: 128M
```
Log management β prevent containers from filling your disk:
```yaml
services:
api:
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
```
---
**FridayβSaturday: Build the anchor project β vps-multiservice**
This is the Week 4 portfolio project. Build it properly.
**Project: Multi-service API stack on VPS**
Repository: `vps-multiservice`
Structure:
```
vps-multiservice/
βββ api/
β βββ app.py
β βββ Dockerfile
β βββ requirements.txt
βββ db/
β βββ init.sql
βββ nginx/ # placeholder config β Week 5 replaces this
β βββ default.conf
βββ docker-compose.yml
βββ docker-compose.override.yml
βββ .env.example
βββ Makefile
βββ README.md
```
`api/app.py` β a real Flask API, not hello-world:
```python
from flask import Flask, jsonify
import psycopg2, redis, os, time
app = Flask(__name__)
def get_db():
return psycopg2.connect(os.environ["DATABASE_URL"])
def get_redis():
url = os.environ.get("REDIS_URL", "redis://cache:6379/0")
return redis.from_url(url)
@app.route("/health")
def health():
checks = {}
# Check Postgres
try:
conn = get_db()
conn.close()
checks["database"] = "ok"
except Exception as e:
checks["database"] = f"error: {e}"
# Check Redis
try:
r = get_redis()
r.ping()
checks["cache"] = "ok"
except Exception as e:
checks["cache"] = f"error: {e}"
all_ok = all(v == "ok" for v in checks.values())
return jsonify({"status": "ok" if all_ok else "degraded", "checks": checks}), \
200 if all_ok else 503
@app.route("/info")
def info():
return jsonify({
"service": "vps-multiservice-api",
"version": os.environ.get("APP_VERSION", "dev"),
"uptime": time.time()
})
@app.route("/cache/set/<key>/<value>")
def cache_set(key, value):
r = get_redis()
r.setex(key, 300, value)
return jsonify({"stored": key})
@app.route("/cache/get/<key>")
def cache_get(key):
r = get_redis()
value = r.get(key)
return jsonify({"key": key, "value": value.decode() if value else None})
if __name__ == "__main__":
app.run(host="0.0.0.0", port=5000)
```
`Makefile`:
```makefile
.PHONY: up down logs ps build restart deploy
up:
docker compose up -d --build
down:
docker compose down
logs:
docker compose logs -f
ps:
docker compose ps
build:
docker compose build --no-cache
restart:
docker compose restart
health:
curl -s http://localhost:5000/health | python3 -m json.tool
shell-api:
docker compose exec api /bin/bash
shell-db:
docker compose exec db psql -U postgres -d appdb
```
`README.md` must answer: what this project is, what services it runs, how to start it locally, how to deploy to production, what the health endpoint checks, what environment variables are required, what the architecture looks like (diagram), and what decisions were made (why named volumes over bind mounts, why `service_healthy` condition on `depends_on`, why non-root user in Dockerfile).
Deploy to the Hetzner VPS: clone the repo, create `.env` from `.env.example`, `make up`, verify `make health` returns 200. Configure the systemd service so it starts on boot.
---
**Sunday: Application strategy**
This is the only Sunday in Phase 1 not dedicated to pure technical work. Block the full day.
**Resume β one page, PDF:**
Header: your name, email, GitHub URL, LinkedIn URL, location (Delhi NCR), phone.
Title: *Junior DevOps & Infrastructure Engineer*
Experience section:
```
Software Developer Intern β [Antler-backed venture studio] [dates]
β’ Shipped production SaaS MVPs across multiple projects; owned Docker-based
deployment pipelines and CI/CD configuration for 3+ products
β’ Implemented GitHub Actions workflows for automated testing and deployment
β’ Worked across time zones with distributed team
```
Projects section (this is more important than the internship for a DevOps role):
```
FactorSphere (factorsphere.org) β Production Edge Platform
β’ Live academic journal ranking platform with 4,000+ journals, real users,
Google-indexed; won college project exhibition
β’ Serverless microservices on Cloudflare Workers (analogous to AWS Lambda +
API Gateway), Pinecone vector database, LLM inference pipeline
β’ CI/CD: GitHub Actions β Wrangler β Cloudflare edge deployment
β’ Architecture: fully edge-native; documented in ARCHITECTURE.md on GitHub
VPS Multi-Service Stack (github.com/you/vps-multiservice)
β’ Multi-service API stack on Ubuntu VPS: Flask + Redis + PostgreSQL
β’ Multi-stage Docker builds, Docker Compose, named volumes, health checks
β’ systemd service management, UFW firewall configuration, SSH hardening
```
Skills section:
```
Infrastructure: Linux (Ubuntu/Arch), Docker, Docker Compose, systemd, UFW, SSH
Scripting: Python (subprocess, requests, argparse, YAML/JSON), Bash
CI/CD: GitHub Actions, Wrangler (Cloudflare)
Networking: DNS, HTTP/S, TCP/IP, SSL/TLS, reverse proxy concepts
Observability: log analysis (journalctl), endpoint monitoring
Platforms: Cloudflare Workers/Pages, Pinecone, Hetzner VPS
Version Control: Git (branching, rebasing, structured commits)
```
Do not list technologies you can't defend in a 5-minute conversation. If Kubernetes is not on your CV, a recruiter won't ask about it. If it is, they will.
**LinkedIn:**
- Headline: *Junior DevOps Engineer | Cloudflare Workers | Docker | Python | Delhi NCR*
- About section: 3 sentences. "CS grad with production experience at an Antler-backed venture studio. Built and deployed FactorSphere (factorsphere.org), a live platform running on a serverless edge architecture. Targeting junior DevOps and cloud infrastructure roles in Delhi NCR."
- Add all projects with links
- Enable "Open to Work" (visible to recruiters, not your network if you prefer)
**Job titles to search:**
- "DevOps Engineer" + fresher/junior/0-2 years
- "Cloud Support Engineer"
- "Infrastructure Engineer"
- "Site Reliability Engineer" (rare at fresher level but exists)
- "Systems Engineer" (often infrastructure work at service companies)
**Platforms, in priority order:**
1. LinkedIn β set job alerts for each title + Delhi, Gurgaon, Noida
2. Naukri.com β mandatory for NCR; service companies and GCCs post exclusively here
3. Instahyre β funded product companies
4. Wellfound (AngelList) β funded startups
5. Company career pages directly: Nagarro, GlobalLogic, Publicis Sapient, Genpact, NIIT Technologies, Mphasis
**Outreach message template (LinkedIn β max 5 lines):**
> Hi [Name] β I'm a CS grad with production experience shipping SaaS MVPs at an Antler-backed venture studio, including a live platform (factorsphere.org) running on a serverless edge architecture with Docker, CI/CD, and Python scripting across projects. I'm targeting junior DevOps roles in NCR and noticed [Company] works on [relevant tech or cloud platform from their JD]. Would it be appropriate to share my CV directly?
Send this to: DevOps leads, engineering managers, or HR at target companies. Not recruiters at staffing agencies (waste of time for this profile). Target 10 outreach messages in the first week of applying.
**By Sunday evening, completed:**
- Resume PDF finalized
- LinkedIn updated
- Naukri profile created with correct resume
- 5 job applications submitted
- 5 outreach messages sent on LinkedIn
**Deliverables, Week 4:**
- `vps-multiservice` repo: running on VPS, full README, Makefile, `.env.example`, meaningful commit history showing incremental development
- `ops-scripts` repo: polished with all Week 1β3 scripts, clean README
- `endpoint-monitor.py` and `vps-health-reporter.py` running and documented
- FactorSphere: `docs/ARCHITECTURE.md` and `docs/CICD.md` committed
- Resume PDF (one page)
- LinkedIn updated
- 5+ applications submitted, 5+ outreach messages sent
- Job alert set on LinkedIn and Naukri
---
## PHASE 2: Weeks 5β12 β Become Hireable by Stronger Companies
---
### Week 5 β Nginx
**What this week unlocks:** Reverse proxy knowledge is expected at junior level. SSL termination and upstream configuration come up in almost every DevOps technical round. You also get HTTPS on your VPS project β a visible signal of operational maturity.
**Study hours: 42**
**Learning objectives:** Virtual hosts, reverse proxy, SSL termination with Let's Encrypt, load balancing upstream blocks, rate limiting, security headers, static file serving
---
**Monday: Installation and configuration structure**
```bash
apt install nginx
systemctl enable nginx
systemctl start nginx
nginx -v
```
Configuration structure on Ubuntu:
- `/etc/nginx/nginx.conf` β main config, defines `worker_processes`, `events`, and the `http` block
- `/etc/nginx/sites-available/` β your server blocks live here
- `/etc/nginx/sites-enabled/` β symlinks to sites-available for active configs
- `nginx -t` β test config syntax; always run before `systemctl reload nginx`
- `systemctl reload nginx` β graceful reload (no dropped connections) vs `restart` (all connections dropped)
Read `/etc/nginx/nginx.conf`. Understand: `worker_processes auto` uses all cores, `worker_connections 1024` limits connections per worker, the `include /etc/nginx/sites-enabled/*` line.
---
**TuesdayβWednesday: Reverse proxy + SSL**
Create `/etc/nginx/sites-available/myapp`:
```nginx
server {
listen 80;
server_name vps.yourdomain.com;
location / {
proxy_pass http://127.0.0.1:5000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_connect_timeout 30s;
proxy_read_timeout 30s;
proxy_buffering on;
}
location /health {
proxy_pass http://127.0.0.1:5000/health;
access_log off; # don't pollute logs with health checks
}
}
```
```bash
ln -s /etc/nginx/sites-available/myapp /etc/nginx/sites-enabled/
nginx -t
systemctl reload nginx
```
SSL with Certbot:
```bash
apt install certbot python3-certbot-nginx
certbot --nginx -d vps.yourdomain.com
# Follow prompts: enter email, agree to TOS, choose redirect HTTPβHTTPS
certbot renew --dry-run # verify auto-renewal works
```
After Certbot runs, inspect `/etc/nginx/sites-available/myapp` β Certbot modifies it. The config now has a `listen 443 ssl` block and an HTTP-to-HTTPS redirect. Read and understand what was added.
Auto-renewal is handled by a systemd timer installed by Certbot: `systemctl status certbot.timer`.
---
**Thursday: Multiple virtual hosts + static files**
Three server blocks:
1. `docs.yourdomain.com` β serves static HTML files from `/var/www/docs/`
2. `api.yourdomain.com` β reverse proxy to Flask API on port 5000
3. Default catch-all β returns 444 (nginx closes connection without response)
Static file serving:
```nginx
server {
listen 443 ssl;
server_name docs.yourdomain.com;
# (SSL certs added by certbot)
root /var/www/docs;
index index.html;
location / {
try_files $uri $uri/ =404;
}
# Cache static assets
location ~* \.(css|js|png|jpg|ico)$ {
expires 30d;
add_header Cache-Control "public, immutable";
}
}
```
Default catch-all:
```nginx
server {
listen 80 default_server;
listen 443 ssl default_server;
server_name _;
ssl_certificate /etc/ssl/certs/ssl-cert-snakeoil.pem;
ssl_certificate_key /etc/ssl/private/ssl-cert-snakeoil.key;
return 444;
}
```
---
**FridayβSaturday: Load balancing + rate limiting + security headers**
Upstream block for load balancing:
```nginx
upstream api_backends {
least_conn; # route to server with fewest active connections
server 127.0.0.1:5000 weight=2;
server 127.0.0.1:5001 weight=1; # start a second Flask instance for this exercise
keepalive 32;
}
server {
location / {
proxy_pass http://api_backends;
}
}
```
Rate limiting:
```nginx
# In http block (nginx.conf or a conf.d include):
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;
limit_req_zone $binary_remote_addr zone=health_limit:1m rate=60r/m;
# In server block:
location /api/ {
limit_req zone=api_limit burst=20 nodelay;
proxy_pass http://api_backends;
}
```
Security headers:
```nginx
add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-XSS-Protection "1; mode=block" always;
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
add_header Referrer-Policy "strict-origin-when-cross-origin" always;
```
---
**Sunday: Update vps-multiservice**
Replace the placeholder Nginx container in Docker Compose with a reference to the host Nginx. Update `README.md`:
- Architecture diagram showing: internet β Nginx (host, SSL) β Docker network β Flask API container
- SSL configuration documented
- How to reproduce (Certbot commands, nginx site config)
Update the `vps-multiservice` README with a "Production Deployment" section showing the full stack.
---
### Week 6 β GitHub Actions for VPS
**What this week unlocks:** CI/CD for VPS-hosted projects is the most visible resume signal. This pipeline demonstrates that your Docker Compose stack is managed like production infrastructure, not run manually. Combined with FactorSphere's existing CI/CD, you can now speak to two different CI/CD patterns.
**Study hours: 42**
---
**MondayβTuesday: GitHub Actions syntax**
Create `.github/workflows/deploy.yml` in `vps-multiservice`:
```yaml
name: Deploy to VPS
on:
push:
branches: [main]
workflow_dispatch: # allow manual trigger from GitHub UI
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Validate docker-compose
run: docker compose config
deploy:
needs: test
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v4
- name: Deploy to VPS
uses: appleboy/ssh-action@v1.0.0
with:
host: ${{ secrets.VPS_HOST }}
username: deploy
key: ${{ secrets.VPS_SSH_KEY }}
script: |
cd /opt/myapp
git pull origin main
docker compose pull
docker compose up -d --build --remove-orphans
docker system prune -f
- name: Smoke test
run: |
sleep 10
curl -f https://vps.yourdomain.com/health || exit 1
```
GitHub Secrets to configure (Settings β Secrets and Variables β Actions):
- `VPS_HOST`: your VPS IP
- `VPS_SSH_KEY`: contents of a dedicated deploy private key (generate a new `ed25519` key specifically for GitHub Actions, add the public key to `~/.ssh/authorized_keys` on the VPS)
Never put the private key in the repo or hardcode it in the workflow.
---
**WednesdayβThursday: Zero-downtime consideration + notifications**
The naive `docker compose up -d --build` restarts containers, causing brief downtime. For a portfolio project this is acceptable. Document this limitation in the README and explain what a production mitigation looks like (blue-green deployment, rolling update in Kubernetes, or a health-check grace period with a load balancer).
Add failure notification via GitHub's built-in email (no setup required β GitHub emails you when a workflow fails on default).
Add a status badge to the README:
```markdown

```
---
**FridayβSaturday: Makefile deployment tooling**
Add to the `Makefile`:
```makefile
deploy:
@echo "Deploying to VPS..."
ssh deploy@$(VPS_HOST) "cd /opt/myapp && git pull && docker compose up -d --build"
rollback:
@echo "Rolling back to previous image..."
ssh deploy@$(VPS_HOST) "cd /opt/myapp && git checkout HEAD~1 && docker compose up -d"
status:
ssh deploy@$(VPS_HOST) "docker compose ps && curl -s http://localhost:5000/health"
logs-prod:
ssh deploy@$(VPS_HOST) "docker compose logs -f --tail=100"
```
Usage with `make VPS_HOST=<ip> deploy` or set VPS_HOST in a local `.env.make` (not committed).
---
**Sunday: FactorSphere CI/CD comparison**
Update `docs/CICD.md` in FactorSphere to add a comparison section:
```
## CI/CD Pattern Comparison
FactorSphere (Cloudflare Workers):
Trigger: push to main
Pipeline: GitHub Actions β wrangler deploy β Cloudflare edge network
Rollback: wrangler rollback <deployment-id> or git revert + push
State: stateless Workers, no server to manage
vps-multiservice (Docker Compose + SSH):
Trigger: push to main
Pipeline: GitHub Actions β SSH β git pull β docker compose up
Rollback: git checkout HEAD~1 + docker compose up (or tag-based rollback)
State: stateful services (Postgres data in named volume), must manage carefully
```
Being able to explain this comparison β why each pattern exists, what tradeoffs each makes β is a strong interview signal.
---
### Weeks 7β9 β AWS (EC2, IAM, S3, VPC, CloudWatch)
Three weeks of AWS. You are building toward a single coherent project: the same multi-service stack, now running on EC2, with IAM roles, S3 backups, VPC networking, and CloudWatch logging.
---
### Week 7 β EC2 + IAM
**What this week unlocks:** AWS is on your CV with real hands-on evidence. Most NCR product companies and GCCs require at least basic AWS. After this week you can pass a cloud support phone screen.
**Learning objectives:** EC2 launch and management, SSH key pairs, security groups, IAM users/roles/policies, principle of least privilege, AWS CLI, instance profiles
---
**MondayβTuesday: AWS account + IAM baseline**
Create an AWS account (personal email, not institutional). Immediately:
1. Enable MFA on the root account
2. Create a billing alarm: CloudWatch β Alarms β Create Alarm β Billing β Total Estimated Charge β threshold $5 β email notification. This protects against accidentally leaving expensive resources running.
3. Create an IAM user for yourself: `yourname-admin`, `AdministratorAccess` policy, programmatic + console access, enable MFA
4. Never use root again
Configure AWS CLI:
```bash
aws configure
# AWS Access Key ID: [from IAM user]
# AWS Secret Access Key: [from IAM user]
# Default region: ap-south-1 (Mumbai β lowest latency from Delhi)
# Default output format: json
aws sts get-caller-identity # verify who you're authenticated as
aws iam get-user # verify correct user
```
IAM β understand these three things cold:
- **User**: a person or application with long-term credentials (access keys)
- **Role**: an identity assumed temporarily; no long-term credentials; assumed by EC2 instances, Lambda, other services
- **Policy**: a JSON document that defines what actions are allowed on which resources
Create a principle of least privilege policy:
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::my-specific-bucket",
"arn:aws:s3:::my-specific-bucket/*"
]
}
]
}
```
This is more secure than `"Action": "s3:*"` and more secure than `"Resource": "*"`. Be ready to explain why.
---
**WednesdayβThursday: EC2 launch and management**
Launch a `t2.micro` (free tier) with Ubuntu 22.04 in ap-south-1:
- AMI: Ubuntu 22.04 LTS (search "ubuntu 22.04 hvm" in Community AMIs or use the Quick Start)
- Instance type: t2.micro (free tier eligible)
- Key pair: create a new key pair named `ec2-deploy-key`, download the `.pem`
- Security group: create `web-sg`
- Inbound: SSH (22) from your IP only (not 0.0.0.0/0), HTTP (80) from 0.0.0.0/0, HTTPS (443) from 0.0.0.0/0
- Outbound: all traffic (default)
- Storage: 8GB gp3 (free tier)
SSH in:
```bash
chmod 400 ec2-deploy-key.pem
ssh -i ec2-deploy-key.pem ubuntu@<ec2-public-ip>
```
Install Docker on the EC2:
```bash
apt update
apt install -y docker.io docker-compose-plugin
usermod -aG docker ubuntu
systemctl enable docker
systemctl start docker
```
User data bootstrap β relaunch with a user data script that installs Docker automatically. In the EC2 launch wizard, under "Advanced Details" β "User data":
```bash
#!/bin/bash
apt-get update -y
apt-get install -y docker.io docker-compose-plugin git
usermod -aG docker ubuntu
systemctl enable docker
systemctl start docker
```
This is how you automate instance bootstrap β important concept for EC2.
---
**FridayβSaturday: IAM role + instance profile**
Create an IAM role for EC2 to access S3:
1. IAM β Roles β Create role
2. Trusted entity: AWS service β EC2
3. Policy: create inline policy:
```json
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": ["s3:PutObject", "s3:GetObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::your-backup-bucket",
"arn:aws:s3:::your-backup-bucket/*"
]
}]
}
```
4. Name the role `ec2-app-role`
5. Attach the role to your EC2 instance: EC2 β Actions β Security β Modify IAM role
Now on the EC2, without any credentials:
```bash
aws s3 ls s3://your-backup-bucket # works because instance has IAM role
aws sts get-caller-identity # shows the assumed-role identity
```
This is the correct way to give EC2 access to AWS services β not by putting access keys on the instance. Be ready to explain why: if the instance is compromised, long-term credentials in environment variables are exfiltrated; a role is temporary and scoped.
Deploy `vps-multiservice` to EC2: clone the repo, create `.env`, `docker compose up -d`. Verify `curl http://<ec2-ip>:5000/health` returns 200.
---
### Week 8 β S3 + VPC
**Learning objectives:** S3 bucket operations, IAM policies for S3, static hosting, lifecycle policies, presigned URLs, boto3; VPC fundamentals β subnets, route tables, internet gateway, security groups vs NACLs
---
**MondayβTuesday: S3**
```bash
# Create bucket
aws s3 mb s3://your-backup-bucket-$(date +%s) --region ap-south-1
# Upload and download
aws s3 cp backup.tar.gz s3://your-backup-bucket/backups/backup.tar.gz
aws s3 sync ./logs/ s3://your-backup-bucket/logs/
aws s3 ls s3://your-backup-bucket/ --recursive
# Static website hosting
aws s3 mb s3://your-docs-site
aws s3 website s3://your-docs-site --index-document index.html --error-document 404.html
aws s3 sync ./docs/ s3://your-docs-site --acl public-read
```
Lifecycle policy (JSON):
```json
{
"Rules": [{
"ID": "archive-old-backups",
"Status": "Enabled",
"Filter": {"Prefix": "backups/"},
"Transitions": [{
"Days": 30,
"StorageClass": "GLACIER"
}],
"Expiration": {"Days": 365}
}]
}
```
```bash
aws s3api put-bucket-lifecycle-configuration \
--bucket your-backup-bucket \
--lifecycle-configuration file://lifecycle.json
```
Presigned URLs with Python (boto3):
```python
import boto3, os
s3 = boto3.client("s3", region_name="ap-south-1")
# Generate URL that expires in 1 hour
url = s3.generate_presigned_url(
"get_object",
Params={"Bucket": "your-backup-bucket", "Key": "backups/backup.tar.gz"},
ExpiresIn=3600
)
print(url) # anyone with this URL can download for 1 hour
```
Update `backup.sh` to upload to S3 after creating the local archive:
```bash
# At end of backup.sh:
if command -v aws &>/dev/null; then
aws s3 cp "$ARCHIVE" "s3://${S3_BUCKET}/backups/$(basename "$ARCHIVE")" && \
echo "Uploaded to S3: s3://${S3_BUCKET}/backups/$(basename "$ARCHIVE")"
fi
```
---
**WednesdayβThursday: VPC**
The default VPC works for most things. Understanding custom VPCs is the interview signal.
Create a custom VPC:
1. VPC β Create VPC
- Name: `app-vpc`
- CIDR: `10.0.0.0/16`
2. Create subnets:
- Public subnet: `10.0.1.0/24`, AZ: ap-south-1a
- Private subnet: `10.0.2.0/24`, AZ: ap-south-1b
3. Create internet gateway, attach to `app-vpc`
4. Route table for public subnet: add route `0.0.0.0/0 β internet gateway`
5. Route table for private subnet: no internet route (private)
6. Launch EC2 in public subnet β it gets a public IP and internet access
7. Understand: an EC2 in private subnet has no internet unless you add a NAT gateway (costs ~$0.045/hour β don't provision, just understand)
The interview question version: "What's the difference between a public and private subnet in AWS?" Answer: a public subnet has a route to an internet gateway; a private subnet does not. Resources in a private subnet can reach the internet via a NAT gateway but cannot be reached from the internet directly.
---
**FridayβSaturday: Integrate S3 into the CI/CD pipeline**
Update `.github/workflows/deploy.yml` in `vps-multiservice`:
```yaml
- name: Archive previous version to S3
run: |
aws s3 cp docker-compose.yml \
s3://your-backup-bucket/deployments/$(date +%Y%m%d_%H%M%S)/docker-compose.yml
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: ap-south-1
```
This demonstrates integration of S3 into a CI/CD pipeline. The backup is a real operational artifact β it means you can reconstruct what was deployed at any given time.
---
### Week 9 β CloudWatch
**Learning objectives:** CloudWatch Logs (log groups, log streams, log agents), metric filters, alarms, custom metrics via boto3
---
**MondayβTuesday: CloudWatch Logs**
Install CloudWatch agent on EC2:
```bash
wget https://s3.amazonaws.com/amazoncloudwatch-agent/ubuntu/amd64/latest/amazon-cloudwatch-agent.deb
dpkg -i amazon-cloudwatch-agent.deb
```
Configure `/opt/aws/amazon-cloudwatch-agent/bin/config.json`:
```json
{
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/nginx/access.log",
"log_group_name": "vps-multiservice",
"log_stream_name": "nginx-access"
},
{
"file_path": "/var/log/nginx/error.log",
"log_group_name": "vps-multiservice",
"log_stream_name": "nginx-error"
}
]
}
}
}
}
```
```bash
/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
-a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -s
```
The EC2 IAM role needs CloudWatch permissions β add to the role:
```json
{
"Effect": "Allow",
"Action": ["cloudwatch:PutMetricData", "logs:*"],
"Resource": "*"
}
```
---
**WednesdayβThursday: Metric filters + alarms**
In the AWS console:
1. CloudWatch β Log groups β `vps-multiservice` β nginx-access
2. Create metric filter: pattern `[ip, id, user, timestamp, request, status_code=5*, ...]`
3. Metric name: `nginx-5xx`, metric value: 1
4. Create alarm on this metric: if sum > 5 in 5 minutes β SNS notification to your email
Billing alarm (if not done in Week 7):
```bash
aws cloudwatch put-metric-alarm \
--alarm-name "billing-alert-5usd" \
--alarm-description "Alert when charges exceed $5" \
--metric-name EstimatedCharges \
--namespace AWS/Billing \
--statistic Maximum \
--period 86400 \
--threshold 5 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=Currency,Value=USD \
--evaluation-periods 1 \
--alarm-actions <your-sns-topic-arn>
```
---
**FridayβSaturday: Custom metrics + Python integration**
Add to `vps-health-reporter.py` β a `--push-to-cloudwatch` flag:
```python
import boto3
def push_to_cloudwatch(metrics_dict, namespace="VPS/Health"):
cw = boto3.client("cloudwatch", region_name="ap-south-1")
metric_data = []
for name, value in metrics_dict.items():
metric_data.append({
"MetricName": name,
"Value": float(value),
"Unit": "Percent" if "pct" in name.lower() else "Count"
})
if metric_data:
cw.put_metric_data(Namespace=namespace, MetricData=metric_data)
print(f"Pushed {len(metric_data)} metrics to CloudWatch")
```
Run this as a cron job on the EC2 every 5 minutes:
```bash
crontab -e
# Add:
*/5 * * * * /usr/bin/python3 /opt/scripts/vps-health-reporter.py --push-to-cloudwatch
```
In the CloudWatch console, verify the custom metrics appear under the `VPS/Health` namespace.
---
### Week 10 β Phase 2 Consolidation
**What this week unlocks:** You have AWS concretely on your CV. This week polishes everything, prepares interview answers for each project, and pushes the application cadence to 10+/week.
**MondayβWednesday: Documentation and GitHub polish**
`vps-multiservice` README must now include:
- Architecture diagram: internet β Route53 (or direct IP) β Nginx (SSL) β Docker network β Flask/Redis/Postgres
- CI/CD flow: push to GitHub β Actions β SSH deploy to EC2 β smoke test
- AWS resources used: EC2, IAM role, S3 (backups), CloudWatch (logs and metrics)
- How to reproduce: step-by-step from zero to running
`ops-scripts` README: add a section on S3 backup integration and CloudWatch metric publishing.
**ThursdayβFriday: Interview preparation β talking about your projects**
For each project, prepare a 3-minute verbal answer to "tell me about this project":
- FactorSphere: "FactorSphere is a live academic journal ranking platform I built that serves real users. The architecture is fully edge-native β it runs on Cloudflare Workers rather than a traditional server, which means requests are processed at the edge closest to the user. The backend is a set of serverless microservices analogous to AWS Lambda β one for search, one for ranking, one for data processing. I integrated Pinecone as a managed vector database for semantic search and an LLM for query understanding. CI/CD is automated via GitHub Actions and Wrangler. I've documented the full pipeline in the repo."
- vps-multiservice: "This is a multi-service API stack I deployed on a VPS and on EC2. It runs Flask, Redis, and Postgres in Docker containers orchestrated by Docker Compose. I wrote multi-stage Dockerfiles to minimize image size, configured health checks so Compose knows when each service is ready before starting dependents, and set up Nginx as a reverse proxy with SSL termination. The deployment is automated via GitHub Actions β push to main triggers an SSH deploy to the server with a smoke test. Logs ship to CloudWatch and backups go to S3 via an IAM role."
**SaturdayβSunday: Application push**
By Week 10, increase to 10β15 applications/week. You now have AWS on your CV. Apply to roles that were out of reach at Week 4:
- "AWS Cloud Engineer (Junior/Fresher)"
- "Cloud Infrastructure Engineer"
- GCC roles that specify AWS in the JD
---
### Weeks 11β12 β Buffer, Deepening, and Interview Prep
**Week 11:** Take the weakest area β likely whichever of Linux/Docker/AWS you feel least confident explaining verbally β and go deeper. Run through interview scenarios: set up a broken Docker network and diagnose it, break an Nginx config and read the error, terminate an EC2 instance and relaunch from scratch without notes.
**Week 12:** Final project polish, README quality pass on all repos, continue applying. If interviews are happening, focus prep here on the specific company's stack.
---
## PHASE 3: Weeks 13β24 β Become Genuinely Competent
---
### Weeks 13β15 β Terraform
**What this unlocks:** Infrastructure as Code is expected at mid-level and increasingly mentioned in junior JDs. More importantly, after doing Weeks 7β9 by hand in the console and CLI, Terraform will click immediately β you already understand the resources, now you're just defining them declaratively.
**Week 13:** Terraform core concepts β providers, resources, state, plan/apply/destroy. Write Terraform that creates the Week 7/8 infrastructure: EC2, security groups, IAM role, S3 bucket. Run `terraform plan` to see what it would create. Run `terraform apply`. Verify it matches what you built manually.
```hcl
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
backend "s3" {
bucket = "your-terraform-state"
key = "vps-multiservice/terraform.tfstate"
region = "ap-south-1"
}
}
resource "aws_instance" "app" {
ami = "ami-0287a05f0ef0e9d9a" # Ubuntu 22.04, ap-south-1
instance_type = "t2.micro"
key_name = aws_key_pair.deploy.key_name
subnet_id = aws_subnet.public.id
vpc_security_group_ids = [aws_security_group.web.id]
iam_instance_profile = aws_iam_instance_profile.app.name
user_data = file("bootstrap.sh")
tags = {
Name = "vps-multiservice"
Project = "portfolio"
}
}
```
**Week 14:** Terraform state, modules, variables, outputs. Write a module for the EC2 + security group pair so it's reusable. Store state remotely in S3 (with DynamoDB locking β use an S3 backend, DynamoDB is free at this scale).
**Week 15:** Terraform for the full Week 7β9 stack: EC2, VPC, subnets, route tables, internet gateway, IAM roles, S3 buckets, CloudWatch log group. The entire infrastructure is now in `infrastructure-iac` repo. `terraform apply` from scratch should produce a working environment.
**Repository:** `infrastructure-iac` β contains Terraform for all AWS infrastructure, README explaining what it creates and why, `.terraform.lock.hcl` committed, state backend configured, variables documented in `variables.tf`
---
### Weeks 16β17 β Prometheus + Grafana
**What this unlocks:** Monitoring stack demonstrates operational maturity beyond deployment. "Have you set up monitoring?" is a common interview question; being able to say yes with a GitHub repo is a strong differentiator at junior level.
**Week 16:** Deploy Prometheus and the Node Exporter on the Hetzner VPS (not EC2 β keep costs at zero here). Prometheus scrapes Node Exporter for system metrics (CPU, memory, disk, network). Scrape the Flask API's `/metrics` endpoint (add `prometheus-flask-exporter` to the app). Prometheus config:
```yaml
# /etc/prometheus/prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: node
static_configs:
- targets: ['localhost:9100']
- job_name: flask-api
static_configs:
- targets: ['localhost:5000']
```
Run everything in Docker Compose β add Prometheus and Node Exporter to the existing `docker-compose.yml` in `vps-multiservice`:
```yaml
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
ports:
- "9090:9090"
node_exporter:
image: prom/node-exporter:latest
network_mode: host
pid: "host"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
```
**Week 17:** Deploy Grafana, connect it to Prometheus as a data source, build dashboards:
- System overview: CPU, memory, disk, network I/O
- Flask API: request rate, error rate, latency (p50, p95, p99)
- Alerts in Alertmanager: email when disk > 80% or API error rate > 5%
Export your dashboards as JSON and commit them to the repo. This means the monitoring stack is reproducible.
---
### Weeks 18β22 β Kubernetes
Five weeks because the surface area is large and the concepts require time to internalize.
**Week 18:** Kubernetes architecture β what a Node is, what a Pod is, what a Deployment is, how the control plane works (API server, scheduler, controller manager, etcd). Install `k3s` on a local KVM VM (not on the VPS β k3s with multiple services will use too much RAM on CX22):
```bash
curl -sfL https://get.k3s.io | sh -
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
kubectl get nodes
kubectl get pods -A
```
Write Kubernetes manifests for the Flask API:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: flask-api
spec:
replicas: 2
selector:
matchLabels:
app: flask-api
template:
metadata:
labels:
app: flask-api
spec:
containers:
- name: flask-api
image: myapi:latest
ports:
- containerPort: 5000
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: app-secrets
key: database-url
livenessProbe:
httpGet:
path: /health
port: 5000
initialDelaySeconds: 10
periodSeconds: 30
resources:
requests:
memory: "64Mi"
cpu: "100m"
limits:
memory: "128Mi"
cpu: "200m"
---
apiVersion: v1
kind: Service
metadata:
name: flask-api
spec:
selector:
app: flask-api
ports:
- port: 80
targetPort: 5000
```
**Weeks 19β20:** ConfigMaps, Secrets, Ingress (using k3s's built-in Traefik ingress controller), PersistentVolumes for Postgres. Deploy the full multi-service stack to Kubernetes.
**Weeks 21β22:** Rolling updates, rollbacks (`kubectl rollout undo deployment/flask-api`), Horizontal Pod Autoscaler, namespace isolation, RBAC basics. Write a GitHub Actions workflow that builds a Docker image, pushes to Docker Hub or GitHub Container Registry, and updates the k3s deployment via `kubectl set image`.
**Repository:** `k8s-fundamentals` β all manifests in `manifests/`, Kustomize or Helm intro in Week 22, README explaining architecture decisions
---
### Weeks 23β24 β Consolidation
Audit all repos: every README answers what it runs, how to run it, what decisions were made, and what you'd do differently with more time. This last part β "what I'd do differently" β is an advanced interview signal. Examples: "I'd add a proper secrets management solution (Vault or AWS Secrets Manager) instead of .env files", "I'd move Terraform state to a proper remote backend with team locking", "I'd add structured logging so Prometheus can ingest log metrics directly."
Write up a "portfolio narrative" β a one-page document (not on GitHub, for your own use) that ties everything together: FactorSphere shows edge-native architecture and LLM integration; vps-multiservice shows traditional server ops; the Terraform repo shows IaC; the monitoring stack shows observability; the Kubernetes repo shows container orchestration. This is your answer to "walk me through your projects" in a senior technical round.
---
## END SECTION
---
### 1. GitHub Repositories by Milestone
**Week 4 Repositories**
`ops-scripts` β bash health check, backup, and log analyzer scripts; Python endpoint monitor, deploy check, Cloudflare deploy monitor, config validator, and VPS health reporter. Why a recruiter cares: it demonstrates ability to automate real operational tasks β the most common take-home test format. Scripts that run on real infrastructure (not mock data) are distinguishable immediately.
`vps-multiservice` β Docker Compose stack (Flask + Redis + Postgres), multi-stage Dockerfile, `.env.example`, Makefile with operational targets, systemd service file, and README documenting architecture decisions. Why a recruiter cares: it's a real multi-service production-style stack running on actual infrastructure, not a tutorial clone. The health check endpoint that checks dependency connectivity is a concrete signal of operational thinking.
`factorsphere` (existing, updated) β `docs/ARCHITECTURE.md` and `docs/CICD.md` added. Why a recruiter cares: live product with real users transforms this from a class project into a production deployment story. The architecture documentation demonstrates that you understand what you built, not just that you ran commands.
**Week 8 Repositories**
`vps-multiservice` (updated) β now includes GitHub Actions CI/CD workflow deploying to both VPS and EC2, Nginx reverse proxy configuration, CloudWatch log shipping, S3 backup integration, and updated README with the full AWS architecture. A badge shows the live CI/CD status. Why a recruiter cares: this is a complete junior DevOps portfolio project covering Docker, CI/CD, AWS, monitoring, and backups β the canonical skill set for a junior role.
`ops-scripts` (updated) β S3 backup integration in `backup.sh`, CloudWatch metric publishing in `vps-health-reporter.py`, updated README. Why a recruiter cares: scripts that interact with real cloud services (boto3, AWS CLI) demonstrate practical cloud knowledge beyond "I know what S3 is."
**Week 24 Repositories**
`infrastructure-iac` β Terraform modules for the full AWS stack (EC2, VPC, subnets, security groups, IAM, S3). `terraform apply` from a clean account produces the entire `vps-multiservice` environment. Why a recruiter cares: IaC in any junior portfolio is unusual and signals maturity; Terraform specifically is the most common IaC tool in NCR DevOps JDs.
`monitoring-stack` β Prometheus, Grafana, and Alertmanager in Docker Compose, dashboard JSON files committed, Prometheus recording rules, Alertmanager config with email notifications. Why a recruiter cares: monitoring is the answer to "how do you know your service is healthy?" β most junior candidates can't demonstrate this.
`k8s-fundamentals` β Kubernetes manifests for the multi-service stack running on k3s, GitHub Actions pipeline that builds and deploys to the cluster, RBAC configuration, HPA configuration. Why a recruiter cares: Kubernetes is increasingly mentioned even in junior NCR DevOps JDs; a working cluster with manifests is proof of hands-on experience, not certification prep.
---
### 2. Job Titles and Application Timing
**Apply now β Week 4**
Titles: Junior DevOps Engineer, DevOps Engineer (0-2 years), Cloud Support Engineer, Infrastructure Engineer, Systems Engineer, Associate DevOps Engineer
Platforms: LinkedIn (primary β set alerts), Naukri.com (mandatory β service companies and GCCs post here exclusively), Instahyre, Wellfound
Company targets in NCR: Nagarro (Gurgaon), GlobalLogic (Noida), Publicis Sapient (Gurgaon), NIIT Technologies, Mphasis, smaller funded startups from Antler portfolio and other VC-backed companies. For service companies: HCL Technologies and Wipro have DevOps JDs that are genuine infrastructure work (not all of them β filter by the actual JD content, not just the title).
Direct career pages worth bookmarking: nagarro.com/careers, globallogic.com/careers, publicissapient.com/careers, mphasis.com/careers. These pages often have roles that don't appear on aggregators for 1β2 weeks after posting.
**Apply at Week 8**
You now have AWS concretely on your CV. Expand to: AWS Cloud Engineer (Junior), Cloud Infrastructure Engineer, Cloud Operations Engineer, DevOps Engineer roles that specify AWS in the JD.
New targets: GCCs that explicitly require AWS β Optum (Noida), Genpact Technology (Gurgaon), EXL Service, WNS, Concentrix Technology. Apply via their career pages and Naukri simultaneously. Your profile is now meaningfully stronger: end-to-end AWS stack (EC2, IAM, S3, VPC, CloudWatch) with CI/CD and a live project to discuss.
**Do not apply until Week 24**
Platform Engineer β typically requires Kubernetes + Terraform + 2+ years. Platform teams at larger companies in NCR have a higher bar.
Senior DevOps Engineer β even at companies that post "1-3 years experience required", the actual expectation is 2+ years of real DevOps work. Applying earlier wastes your time.
DevSecOps Engineer β security tooling (Vault, Snyk, SAST/DAST pipelines) is a domain requiring deliberate study not on this roadmap.
Cloud Architect β requires 5+ years and cert-level AWS knowledge.
---
### 3. What Gets Tested in NCR DevOps Interviews
**Phone screen / HR call**
You will be asked: your experience summary (have a 90-second version), which tools you've used (answer specifically β not "I know Docker", but "I've written multi-stage Dockerfiles and run Compose stacks on a VPS"), availability and notice period (fresh grad = immediate joining), salary expectation (state your target, not your floor β see Section 5), why DevOps specifically.
They are screening for: articulate communication, genuine experience (not just listed on CV), salary fit, and basic logical thinking. They are not testing technical knowledge here.
**Technical round**
Linux questions that actually appear:
- "What does the first column of `ps aux` output mean?" (process owner)
- "How do you find which process is using port 8080?" (`ss -tlnp | grep 8080` or `lsof -i :8080`)
- "What's the difference between `kill -9` and `kill -15`?" (SIGKILL vs SIGTERM β immediate termination vs graceful)
- "A service is failing to start. Walk me through your diagnosis." (`systemctl status`, `journalctl -u servicename -n 50`, check the binary exists and permissions are correct, check the port isn't already in use)
- "What is load average?" (average number of processes waiting to run over 1/5/15 minutes, relative to CPU cores)
Docker questions that actually appear:
- "What's the difference between CMD and ENTRYPOINT in a Dockerfile?" (CMD is overridable at run time, ENTRYPOINT is fixed β CMD provides default arguments to ENTRYPOINT)
- "How do containers in a Docker Compose network resolve each other?" (by service name β Docker provides built-in DNS for user-defined networks)
- "What's a multi-stage build and why use it?" (multiple FROM instructions, only the final stage goes to the image β keeps build tools out of the production image, reduces size and attack surface)
- "How do you make a Docker container restart automatically?" (`--restart unless-stopped` or `restart: unless-stopped` in Compose)
Networking questions that actually appear:
- "Walk me through what happens when a user visits a URL" (DNS lookup β TCP connection β TLS handshake β HTTP request β response)
- "What's the difference between a reverse proxy and a load balancer?" (reverse proxy hides the backend; load balancer distributes across multiple backends β Nginx can do both)
- "What does a 502 error mean?" (bad gateway β the proxy received an invalid response from the upstream server)
- "How does SSH authentication work?" (client presents public key, server challenges with random data, client signs it with private key, server verifies the signature with the stored public key)
Scripting:
- You may be asked to write a Bash or Python script live. Common formats: "write a script that checks if these services are running and restarts any that aren't", "write a function that parses this log file and counts occurrences of each status code"
- The evaluator is checking: do you use proper error handling (`set -euo pipefail` in bash, `try/except` in Python), do you use functions, do you write readable code
**Practical / take-home task**
Common formats (2β4 hours):
- "Write a Dockerfile for this Python app, a Docker Compose file that adds Redis, and a health check endpoint" β you'll have a GitHub repo to fork
- "Write a GitHub Actions workflow that runs tests and deploys to a remote server on push to main" β they provide fake SSH credentials or ask you to describe what the secrets would contain
- "Write a Python script that monitors these endpoints and emails you if any return non-200" β tests requests, SMTP, argparse
- "Debug this broken docker-compose.yml" β they give you a file with 3-5 intentional errors
What separates good submissions: health checks are included, `.env.example` is present, README explains what it does and how to run it, commit history shows incremental work (not one giant commit), you handle error cases explicitly.
---
### 4. Common Mistakes That Prevent Freshers From Getting DevOps Jobs in India
**Listing technologies you can't defend.** If "Kubernetes" is on your CV because you ran `kubectl get pods` once in a tutorial, an interviewer asking "how does a rolling update work?" will end the conversation. Every technology on your CV must have a project behind it and a clear verbal answer to "tell me how you used this."
**GitHub repos without READMEs.** A recruiter spending 30 seconds on a repo with no README closes the tab. A repo called `docker-practice` with three files and no explanation tells a hiring manager nothing about your ability to operate infrastructure. Every repo needs to explain what it runs, why it exists, and what decisions were made.
**Applying to renamed helpdesk roles.** Many "DevOps Engineer" JDs at service companies are L1 support with a rebranded title. Read the JD carefully β red flags include: "ITIL certification preferred", "ticketing system experience required", "incident management", no mention of Docker/AWS/Linux in the technical requirements. These roles do not build the skills you need and often pay below the floor you've set.
**Not knowing your own project in depth.** "I deployed a multi-service Docker Compose stack" is not an answer. "I deployed a Flask API with Redis and Postgres in Docker Compose, using multi-stage builds to reduce image size from 1.1GB to 180MB, with health checks so Postgres is confirmed ready before the API starts, running behind Nginx with SSL termination, and automated via GitHub Actions" is an answer. Practice this verbally, not just in your head.
**Treating salary floor as the opening number.** If you say "I'm looking for around βΉ35,000" at a startup that would pay βΉ55,000 for your profile, you've lost βΉ20,000/month permanently. Know the market rates (see Section 5), start at your target ceiling for each company tier, not your floor.
**One-commit GitHub histories.** A repo where all the work appears in a single commit titled "add project files" signals that you copied files over rather than built incrementally. Commit as you work. The commit history is evidence of your process.
**Not applying early enough.** The job search takes time independent of your preparation level. Candidates who start applying at Week 4 get their first offers at Week 8β12. Candidates who wait until they feel "ready" start at Week 8 and get their first offers at Week 14β18. The feedback from real rejections improves your interview performance faster than another week of studying.
**Inconsistent positioning.** Applying to a DevOps role and then mentioning in the interview that you also do React or full-stack work signals that you don't actually want a DevOps role. Decide on the role type and be consistent in every touch point β resume, LinkedIn, outreach, interview answers.
**Vague answers to "tell me about your experience."** "I worked at an Antler-backed company on various projects" is vague. "I was a software developer intern at a venture studio backed by Antler, where I shipped production SaaS MVPs across multiple projects β I owned Docker configuration and CI/CD pipelines across three projects, integrated with external APIs, and worked with distributed teams across time zones" is not vague.
**Sending generic outreach.** A LinkedIn message that could have been sent to anyone ("I am a passionate DevOps professional seeking opportunities") gets ignored. One that references the company's actual stack or a specific JD they posted gets responses. Take 3 minutes per message to make it specific.
---
### 5. Realistic Salary Ranges
**NCR service companies (TCS, HCL, Wipro, Infosys, Tech Mahindra)**
Range: βΉ3.5β5 LPA (CTC). Take-home is ~70β75% of CTC after PF, tax, insurance deductions. At βΉ4.5 LPA CTC, take-home is approximately βΉ28,000β31,000/month.
Your profile puts you toward the upper end of the fresher band, but service companies have fixed fresher slabs that don't move much for individual profile quality. The internship experience may place you in the "experienced fresher" band at some companies (βΉ4.5β5.5 LPA) versus the "campus fresher" band (βΉ3.5β4 LPA).
Honest assessment: these roles are the right floor for negotiation, not the target. The actual work at L1/L2 entry in service companies is often not genuine infrastructure. The learning environment is slower. Treat these as the fallback, not the goal.
Skills that move you up within this band: AWS certification (not worth getting for this tier, though), ITIL knowledge (not worth learning for this tier either). Don't optimize for this band.
**NCR funded startups (Series AβC, Antler portfolio)**
Range: βΉ5β8 LPA for a genuine junior DevOps role. At βΉ6 LPA, take-home is approximately βΉ42,000β45,000/month depending on structure.
Your Antler studio connection is a direct advantage here. Antler portfolio companies know what their studio interns produce β the signal is stronger than a cold application. FactorSphere as a live production product with real users is unusual for a fresher; most candidates this stage have tutorial clones.
Your realistic target at a well-funded startup: βΉ6β7 LPA at Week 4, negotiable to βΉ7β8 LPA after Week 8 with AWS on the CV.
Skills that move you up this band: AWS working knowledge (Week 8), ability to own deployment pipelines end-to-end without supervision, Docker and Compose at production level (Week 4), Python scripting that actually runs in their infrastructure.
**Product companies and GCCs (Nagarro, GlobalLogic, Publicis Sapient, Optum, Genpact Tech, EXL)**
Range: βΉ6β10 LPA for cloud/DevOps roles. At βΉ8 LPA, take-home is approximately βΉ55,000β60,000/month.
GCCs pay more than domestic companies for equivalent work β they're paying against a global compensation benchmark. The tradeoff is a higher technical bar at screening and more structured interview processes.
Your profile is competitive here after Week 8 (AWS added). Before Week 8, your serverless and Cloudflare experience is harder to map to what GCC interviewers are looking for; after Week 8, the AWS + Docker + CI/CD story is a clean match.
Realistic target at a GCC: βΉ7β9 LPA at Week 8, with AWS and a demonstrated CI/CD project.
Skills that move you up this band: specific AWS service depth (beyond EC2/S3 β RDS, ECS, ECR, CodePipeline), monitoring/observability (Prometheus or CloudWatch), IaC awareness (Terraform in Phase 3).
**Your Week 4 realistic ceiling (before AWS):** βΉ6β7 LPA at a funded startup or mid-size product company where your FactorSphere + Docker + CI/CD story lands well.
**Your Week 8 realistic ceiling (after AWS):** βΉ8β9 LPA at a GCC or strong product company.
**Your floor (non-negotiable):** βΉ35,000/month = ~βΉ4.2 LPA. Achievable at service companies. Don't accept less β it's below market even at service companies for a profile with live production experience.
**Negotiation:** When a recruiter asks for your expectation, say the target number for that company tier, not your floor. "I'm looking for βΉ6β7 LPA, based on my production experience and the market for junior DevOps roles in NCR." If they push back, ask what the band is before accepting or declining.
---
### 6. Honest Assessment of the Week 4 Target
Week 4 interview-readiness for junior DevOps roles at startups and mid-size product companies in NCR is realistic. This is not inflated encouragement β it's based on what your profile actually produces by Week 4:
You arrive at Week 4 with: Python competency, Linux daily driver experience, Docker from internship, Git fluency, a live production product with real users, and production CI/CD experience. These are not nothing. Most fresher DevOps candidates have none of the internship context and only classroom exposure.
By Week 4 you add: ops-level Linux (systemd, SSH hardening, process management, log analysis), practical networking (DNS, HTTP, SSH tunneling, firewalls), DevOps Python scripting (subprocess, requests, YAML/JSON, proper CLIs), and production Docker + Compose (multi-stage builds, health checks, named volumes, systemd-managed stacks).
This is enough to pass a phone screen and a standard junior technical round at a startup or mid-tier product company. It is not enough to pass a rigorous GCC technical screen that probes AWS depth β that happens after Week 8.
**Most likely blocker:** The gap between knowing how something works and being able to explain it under mild interview pressure. Docker networking questions specifically β "a container in service A can't reach service B, what do you check?" β require not just knowing the answer but being comfortable walking through it out loud. This gap closes with deliberate verbal practice, not with more studying. Spend 30 minutes every day of Week 4 talking through your projects as if you're in an interview. Record yourself if you can β the first playback will tell you exactly what to fix.
**Second most likely blocker:** Sparse GitHub repos that don't match your verbal claims. If you say "I built a multi-service Docker Compose stack on a VPS" and the recruiter opens the repo to find three files and no README, you've undermined your own story. Prioritize clean, documented repos with meaningful commit history over adding more features.
**Week 6β8 for first offer:** Achievable if you apply at 10+ per week starting Week 4 and follow up on applications actively. Realistic for the right startup or product company.
**Week 8β10 for first offer:** The more likely outcome for most candidates, accounting for interview scheduling delays (common in India), HR processes, and the normal distribution of fit between your profile and open roles. This is not a failure case β it's the median outcome for a candidate with your profile executing this plan.
**Week 6β8 for GCC offer:** Unlikely. GCC processes in NCR typically take 4β6 weeks from application to offer even when you pass every round. Apply to GCCs at Week 8 and expect the offer to come at Week 12β14.
The 4-week interview-readiness target is sound. The 6β8 week first offer target is optimistic but achievable. The 8β10 week target is realistic without being pessimistic. The answer to "should I start applying at Week 4 even if I don't feel ready?" is yes, unambiguously.