Job Description
            
                Microsoft Azure is the fastest-growing business in Microsoft’s history and serves as the foundation of Microsoft’s commercial cloud services.
Our team, Azure Core, builds and manages the core platform that supports a wide range of services.
As a Principal Software Engineer, you will have an exciting opportunity to innovate and shape the future of computing, and we encourage you to apply and learn more.
  
Our team thrives on collaboration.
You will work alongside a diverse group of professionals who welcome challenges, value continuous learning, and flourish in a cooperative environment.
We embrace inclusivity and diverse perspectives, using empathy, trust, and accountability to drive our culture and deliver solutions in an iterative manner.
You will be part of a fast-paced environment, solving complex problems that require creativity and teamwork to achieve meaningful business outcomes.
We continuously strive for engineering and operational excellence.
  
Azure Core is building the foundation for Microsoft’s cloud services, focusing on infrastructure and advanced cloud platform technologies such as cloud-native applications, containerization (Kubernetes), site reliability engineering (SRE), and high-performance computing (HPC).
We are developing next-generation artificial intelligence (AI) data centers to power large-scale training and inference.
We seek experienced Principal Software Engineers who can design, bootstrap, and operate infrastructure at hyperscale.
  
We are hiring a highly motivated Principal Software Engineer who is passionate about Linux, Kubernetes and AI infra, embracing OSS and willing to navigate through ambiguities
  
Microsoft’s mission is to empower every person and every organization on the planet to achieve more.
As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals.
Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
  
**Responsibilities**
  
+ Partners with appropriate stakeholders to determine user requirements for one or more complex scenarios.
+ Provides technical leadership for the identification of dependencies and the development of design documents for a product, application, service, or platform.
+ Leads by example and mentors others to produce extensible and maintainable code used across the company.
+ Leverages deep subject-matter expertise of cross-product features with appropriate stakeholders (e.g., project managers) to lead multiple product's project plans, release plans, and work items.
+ Holds accountability as a Designated Responsible Individual (DRI), mentoring engineers across products/solutions, working on-call to monitor system/product/service for degradation, downtime, or interruptions.
+ Proactively seeks new knowledge and adapts to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale and shares knowledge with other engineers.
  
**Qualifications**
  
**Required Qualifications:**
  
+ Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
+ OR equivalent experience.
+ 1+ year(s) experience bootstrapping and managing data center (DC) infrastructure, including device inventory, diagnosis, and repairs.
+ Bare-metal provisioning using Preboot Execution Environment (PXE), iPXE, Redfish, Open Baseboard Management Controller (OpenBMC), Intelligent Platform Management Interface (IPMI), and Simple Network Management Protocol (SNMP).
+ Networking and security expertise in high-performance networking technologies such as NVIDIA Collective Communications Library (NCCL), InfiniBand Extensions for Scalable High-Performance Computing (IMEX), Remote Direct Memory Access (RDMA) over InfiniBand or RDMA over Converged Ethernet version 2 (RoCEv2), and Extended Berkeley Packet Filter (eBPF).
+ Driver and firmware lifecycle management, including Graphics Processing Unit (GPU) diagnostics.
+ 1+ year(s) experience with storage and acceleration technologies for Artificial Intelligence (AI) workloads, including distributed storage systems for multi-exabyte AI workloads, high-throughput data pipelines, preprocessing, and dataset versioning.
+ Data Processing Unit (DPU) acceleration.
+ Linux and GPU internals expertise for performance optimization and troubleshooting.
+ Kernel-level performance tuning, including Non-Uniform Memory Access (NUMA) awareness, Interrupt Request (IRQ) balancing, and control group (cgroup) tuning for GPUs.
  
**Other Requirements:**
  
+ Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.
These requirements include, but are not limited to the following specialized security screenings: 
+ Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
  
**Preferred Qualifications:**
  
+ Bachelor's Degree in Computer Science
+ OR related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript,
+ OR Python
+ OR Master's Degree in Computer Science or related technical field AND 10+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
+ OR equivalent experience.
+ 1+ year(s) experience with Artificial Intelligence (AI) and Machine Learning (ML) job scheduling and orchestration at scale, using technologies such as Simple Linux Utility for Resource Management (SLURM), Ray, and Kueue.
+ Model training optimization for performance and scalability.
+ 1+ year(s) experience improving model serving and inference efficiency, ensuring low latency and high throughput for production workloads.
  
Software Engineering IC6 - The typical base pay range for this role across the U.S. is USD $163,000 - $296,400 per year.
There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $220,800 - $331,200 per year.
  
Certain roles may be eligible for benefits and other compensation.
Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay (https://care`ers.microsoft.com/us/en/us-corporate-pay)
  
Microsoft will accept applications and processes offers for these roles on an ongoing basis.
  
#azurecorejobs
  
Microsoft is an equal opportunity employer.
Consistent with applicable law, all qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances.
If you need assistance and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations (https://careers.microsoft.com/v2/global/en/accessibility.html) .