Job Type: Full Time
Locations: Atlanta
Employment Type :
Full-time
Experience :
10+ Years
Required Skills :
Technical Skills
- Elastic Expert : 5+ years of production experience with the Elastic Stack (Elasticsearch, Kibana, Logstash, Beats).
- Kubernetes Mastery : 3+ years managing Elastic Cloud on Kubernetes (ECK) or similar operators on enterprise K8s distributions (Anthos, GKE, or EKS).
- Ingest & Pipelines: Deep knowledge of log ingest architectures, index templates, sharding strategies, and cluster tuning.
Leadership & Experience
- Onsite Leadership : Proven ability to run day-to-day operations, manage technical rosters, and lead cross-functional troubleshooting sessions.
- Enterprise Scale : Experience supporting large-scale platforms (e.g., ETL jobs, microservices, anomaly detection) in a complex corporate environment.
- Incident Management : Familiarity with SRE practices, including incident response (P1-P3), MTTR tracking, and root cause analysis.
Preferred Skills
- Experience with Infrastructure-as-Code (Terraform, Helm) and CI/CD pipelines (Jenkins,GitLab).
- Knowledge of migration strategies between legacy logging platforms (e.g., Splunk) and Elastic.
Responsibilities:
1. Platform Architecture & Onsite Leadership
- Onsite Operational Lead : Act as the primary point of contact for platform stability. Manage daily stand-ups, prioritize the engineering backlog, and coordinate between offshore and onsite teams.
- Architecture & Reliability : Own the Elastic Stack (Elasticsearch, Kibana, Ingest components) deployed on Elastic Cloud on Kubernetes (ECK).
- Capacity Planning : Design cluster topology (nodes, roles, zones) and manage resource quotas (CPU, Heap, Disk) to ensure cost-efficiency and performance.
- SLO Management : Define and track Service Level Objectives (SLOs) for ingestion latency, search availability, and data retention.
2. Logging Strategy & Data Modeling
- Standardization : Define enterprise logging and index templates, including field conventions (service, environment, tenant, correlation IDs) to ensure reliable event correlation across the observability stack.
- Schema Design : Work with application teams to implement standardized mappings and Index Lifecycle Management (ILM) policies (Hot/Warm/Cold tiers, rollover, and retention).
- Data Quality : Own ingest patterns using Filebeat, Fluent Bit, Logstash, or Elastic Agent. Design parsing pipelines (JSON, Grok), enrichment logic, and dead-letter queue strategies.
3. Kubernetes & Infrastructure Operations
- Cluster Management : Oversee daily health across Kubernetes-based clusters (e.g., Anthos/GKE). Resolve pod-level issues such as CrashLoopBackOffs, memory spikes, and disk usage alerts.
- Storage Operations : Lead the management and migration of Persistent Volume Claims (PVCs) for stateful sets, ensuring high availability during infrastructure upgrades.
- Security & Governance : Implement RBAC for indices and Kibana spaces. Enforce data governance and retention policies based on data classification (Infra logs vs. App logs vs. Sensitive data).
4. Observability & Enablement
- Self-Service Enablement : Deliver Kibana dashboards and “Golden Queries” to support SRE and NOC teams in rapid incident triaging.
- Documentation & Runbooks : Author and maintain operational runbooks, disaster recovery scenarios, and “Self-Help” guides for onboarding new log sources.
- Mentorship : Provide technical guidance and training sessions for platform engineers and application developers on effective search and logging practices.
Apply for this position
Alternatively, you may email your resume to [email protected]