- Expertini Resume Scoring: Our Semantic Matching Algorithm evaluates your CV/Résumé before you apply for this job role: Reliability, Availability and Serviceability Expert, Datacenter AI Products Development.
Urgent! Reliability, Availability and Serviceability Expert, Datacenter AI Products Development Job Opening In Santa Clara – Now Hiring NVIDIA
For two decades, we have pioneered visual computing, the art and science of computer graphics - with our invention of the GPUs, the engine of modern AI technologies, the field has expanded to encompass AI-powered video games, social networking and web search, IC & other product design, medical diagnosis, and scientific research.
Today, visual computing is the critical computing engine for deep learning-based AI including ChatGPT, becoming increasingly central to how people entertain and interact, and there has never been a more exciting time to join us to enable visual computing and AI to the next chapter.
We are looking for one product development engineer as a SME to drive key aspects of RAS/Resilience features from Chip to module to server for our next-generation products for AI Applications.
We are expecting you to bring deep knowledge and experience in RAS/Resilience testing, characterization, analysis, benchmarking, and risk assessment of large AI training or HPC cluster systems with InfiniBand or enhanced Ethernet.
What you’ll be doing:
The focal point SME for manufacturing test requirements, test methodology, test plan and test flow for AI system RAS/Resilience features to ensure good test coverage and successful production ramp-ups.
Own the AI system RAS/Resilience models, Benchmarking and Risk assessment.
Own the troubleshooting and root-causing of AI system RAS/Resilience related failures at factory and in the field.
Drive the end-to-end RAS efforts of chip-board-system to reduce FIT rates.
Lead the data analysis of RAS/Resilience logs to refine, revise and overhaul test methodology and manufacturing flows; influence and drive software tools/infrastructure required for new product development, validation, and productization.
Opportunity to work closely and partner with architecture, hardware, software, and product engineering teams through the product development lifecycle.
Be ready to be challenged to assess new hardware features and architect manufacturing RAS tests, flows, methodologies.
You'll nurture a deep understanding of NVIDIA's AI hardware and software architecture.
What we need to see:
BS or higher in EE, CE, CS, Mathematics, or equivalent experience.
12+ years proven hands-on experiences in design, testing, benchmarking, and risk assessment of system RAS / Resiliency features of large Compute or AI or HPC systems.
Proficient in Compute System RAS/Resilience model theory and methodology.
Proficient in HPC or AI system architecture and Cluster Interconnect technologies.
Proficient in using test equipment, Linux commands and benchmark utilities to test and trouble-shoot compute system RAS & Resiliency features.
Strong problem-solving and trouble-shooting expertise; and institutionalizing root-cause analysis.
Self-initiative, strong interpersonal skills, and flexibility to adapt to new technologies.
Solid Knowledge and/or Experience in HPC or MLPerf benchmarking is a plus.
NVIDIA is widely considered to be one of the technology world’s most desirable employers! We have some of the most forward-thinking and hardworking people in the world working for us.
If you're creative and autonomous, we want to hear from you!
You will also be eligible for equity and .
NVIDIA accepts applications on an ongoing basis.
✨ Smart • Intelligent • Private • Secure
Practice for Any Interview Q&A (AI Enabled)
Predict interview Q&A (AI Supported)
Mock interview trainer (AI Supported)
Ace behavioral interviews (AI Powered)
Record interview questions (Confidential)
Master your interviews
Track your answers (Confidential)
Schedule your applications (Confidential)
Create perfect cover letters (AI Supported)
Analyze your resume (NLP Supported)
ATS compatibility check (AI Supported)
Optimize your applications (AI Supported)
O*NET Supported
O*NET Supported
O*NET Supported
O*NET Supported
O*NET Supported
European Union Recommended
Institution Recommended
Institution Recommended
Researcher Recommended
IT Savvy Recommended
Trades Recommended
O*NET Supported
Artist Recommended
Researchers Recommended
Create your account
Access your account
Create your professional profile
Preview your profile
Your saved opportunities
Reviews you've given
Companies you follow
Discover employers
O*NET Supported
Common questions answered
Help for job seekers
How matching works
Customized job suggestions
Fast application process
Manage alert settings
Understanding alerts
How we match resumes
Professional branding guide
Increase your visibility
Get verified status
Learn about our AI
How ATS ranks you
AI-powered matching
Join thousands of professionals who've advanced their careers with our platform
Unlock Your Reliability Availability Potential: Insight & Career Growth Guide
Real-time Reliability Availability Jobs Trends in Santa Clara, United States (Graphical Representation)
Explore profound insights with Expertini's real-time, in-depth analysis, showcased through the graph below. This graph displays the job market trends for Reliability Availability in Santa Clara, United States using a bar chart to represent the number of jobs available and a trend line to illustrate the trend over time. Specifically, the graph shows 3118 jobs in United States and 56 jobs in Santa Clara. This comprehensive analysis highlights market share and opportunities for professionals in Reliability Availability roles. These dynamic trends provide a better understanding of the job market landscape in these regions.
Great news! NVIDIA is currently hiring and seeking a Reliability, Availability and Serviceability Expert, Datacenter AI Products Development to join their team. Feel free to download the job details.
Wait no longer! Are you also interested in exploring similar jobs? Search now: Reliability, Availability and Serviceability Expert, Datacenter AI Products Development Jobs Santa Clara.
An organization's rules and standards set how people should be treated in the office and how different situations should be handled. The work culture at NVIDIA adheres to the cultural norms as outlined by Expertini.
The fundamental ethical values are:The average salary range for a Reliability, Availability and Serviceability Expert, Datacenter AI Products Development Jobs United States varies, but the pay scale is rated "Standard" in Santa Clara. Salary levels may vary depending on your industry, experience, and skills. It's essential to research and negotiate effectively. We advise reading the full job specification before proceeding with the application to understand the salary package.
Key qualifications for Reliability, Availability and Serviceability Expert, Datacenter AI Products Development typically include Engineers and a list of qualifications and expertise as mentioned in the job specification. Be sure to check the specific job listing for detailed requirements and qualifications.
To improve your chances of getting hired for Reliability, Availability and Serviceability Expert, Datacenter AI Products Development, consider enhancing your skills. Check your CV/Résumé Score with our free Resume Scoring Tool. We have an in-built Resume Scoring tool that gives you the matching score for each job based on your CV/Résumé once it is uploaded. This can help you align your CV/Résumé according to the job requirements and enhance your skills if needed.
Here are some tips to help you prepare for and ace your job interview:
Before the Interview:To prepare for your Reliability, Availability and Serviceability Expert, Datacenter AI Products Development interview at NVIDIA, research the company, understand the job requirements, and practice common interview questions.
Highlight your leadership skills, achievements, and strategic thinking abilities. Be prepared to discuss your experience with HR, including your approach to meeting targets as a team player. Additionally, review the NVIDIA's products or services and be prepared to discuss how you can contribute to their success.
By following these tips, you can increase your chances of making a positive impression and landing the job!
Setting up job alerts for Reliability, Availability and Serviceability Expert, Datacenter AI Products Development is easy with United States Jobs Expertini. Simply visit our job alerts page here, enter your preferred job title and location, and choose how often you want to receive notifications. You'll get the latest job openings sent directly to your email for FREE!