Job Title: Senior Software Engineer AI Research Clusters
Location: Remote
Employment Type: Full-time
Role Overview
We are seeking a Senior Software Engineer to design, build, and optimize large-scale AI research clusters. This role focuses on distributed systems, high-performance computing, and infrastructure that supports AI/ML workloads such as model training and experimentation.
Key Responsibilities
Required Qualifications
Bachelor's or Master's degree in Computer Science, Engineering, or related field. Strong programming skills in Python, Go, C++, or similar. Experience with distributed systems and parallel computing. Hands-on experience with containerization and orchestration tools (Docker, Kubernetes). Familiarity with cloud platforms (AWS, Azure, or Google Cloud Platform) or on-prem HPC clusters. Understanding of networking, storage systems, and system performance tuning.
Preferred Skills
Experience with ML frameworks (TensorFlow, PyTorch). Familiarity with GPU computing (CUDA, NCCL). Knowledge of cluster schedulers (Slurm, Kubernetes schedulers). Experience with big data tools (Spark, Ray). Exposure to MLOps and experiment tracking tools.
Key Competencies
Strong problem-solving and systems thinking Collaboration with research and engineering teams Performance optimization mindset Ownership and accountability
For applications and inquiries, contact: hirings@openkyber.com