21,610 AI Infrastructure jobs in the United States
AI Infrastructure Engineer
Posted 3 days ago
Job Viewed
Job Description
Job DescriptionJob Description
The Role:
Spellbrush, the world’s leading generative AI studio behind niji・journey , is looking for an AI Infrastructure Engineer to join us in building out end-to-end ML infrastructure to run our models on all platforms.
What you’ll do:
-
Design, implement and run our next- inference architecture for running all our models powering all platforms and applications (mobile, web, etc.).
-
Work alongside a fast-paced and nimble team developing the latest state-of-the-art image models serving over 16 million users
You might be a great fit if:
-
You have experience with large distributed systems
You have familiarity with the latest hotness like K8S, Kafka, NATS, Redis, etc. You’ve cut your teeth on both on-prem and multi-cloud clusters. But most importantly, you deeply understand the tradeoffs and the failure mode of each system you introduce (and potentially even have the battle scars to prove it!).
-
You have an excellent understanding of GPU’s handling large workloads
GPU workloads are different from traditional CPU workloads in very interesting ways. Experience deploying, or even optimizing them end-to-end, is a huge plus for this role
-
The anime aesthetic resonates with you.
It's no secret – we're huge anime enthusiasts, and our work focuses on the anime aesthetic. Your work will enable millions of users to partake in an evolving creative movement.
-
You're comfortable working on small, fast-paced teams
We currently have a small tight-knit team on AI. You'll be working closely alongside some of the best AI researchers in the world, on the literal best image model in the world.
We also believe in the unmatched speed of in-person teams, and prefer on-site collaboration in either our primary research office in Tokyo (downtown Akihabara), or San Francisco. Visa sponsorships are available.
The final base salary is dependent upon location, experience, fit, and other factors. In addition, we offer a generous compensation package that includes equity, top-tier employer-sponsored health, dental, and vision insurance, and additional perks!
At Spellbrush, we value creativity, collaboration, and innovation. If you're excited about working with cutting-edge technology and passionate about anime, gaming, and generative AI, we’d love to hear from you!
To apply - please share your previous work experience/resume, Github, or portfolio and the name of the best waifu or husbando in your message!
AI Infrastructure Engineer
Posted 6 days ago
Job Viewed
Job Description
DRW is a diversified trading firm with over 3 decades of experience bringing sophisticated technology and exceptional people together to operate in markets around the world. We value autonomy and the ability to quickly pivot to capture opportunities, so we operate using our own capital and trading at our own risk.
Headquartered in Chicago with offices throughout the U.S., Canada, Europe, and Asia, we trade a variety of asset classes including Fixed Income, ETFs, Equities, FX, Commodities and Energy across all major global markets. We have also leveraged our expertise and technology to expand into three non-traditional strategies: real estate, venture capital and cryptoassets.
We operate with respect, curiosity and open minds. The people who thrive here share our belief that its not just what we do that mattersit's how we do it. DRW is a place of high expectations, integrity, innovation and a willingness to challenge consensus.
As anAI Infrastructure Engineer at DRW, you will be an integral member of a collaborative research team solving the financial markets using machine learning. Youll work on high-impact machine learning (ML) and artificial intelligence (AI) projects central to our core business. In this role, you will build, maintain and optimize training and inference infrastructure to support researcher to build AI models for financial markets and discover innovative methods to challenging data and machine learning technical problems.
Key Responsibilities:
- Drive end-to-end development of data and AI infrastructure: from initial proof-of-concept to production deployment and ongoing maintenance.
- Provide technical leadership in selecting, integrating, and optimizing AI and ML frameworks, libraries, and tools across diverse hardware and software environments.
- Maintain, and optimize training infra stack, including data pipeline, GPU utilization, monitoring, and observability.
- Proactively troubleshoot performance bottlenecks, conduct root-cause analyses, and implement solutions to optimize GPU or CPU resource usage for both training and inference.
- Design and implement strategies for efficient data movement between storage and GPUs, ensuring high throughput and low latency.
- Develop and maintain high-performance data loading and preprocessing pipelines that maximize GPU utilization.
- Optimize data access patterns and memory management to improve the efficiency of large dataset processing.
- Architect solutions for handling vast volumes of data, ensuring scalability and performance.
Qualifications:
- 3+ years with demonstrated experience in optimizing data movement and processing for GPU-based systems.
- Expertise in GPU memory management and data transfer optimization.
- Experience with GPU-accelerated libraries like RAPIDS
- Skills in developing high-performance data loading and preprocessing pipelines with tools like DALI.
- Skills in profiling and optimizing GPU code using tools like NVIDIA Nsight and nvprof.
- Knowledge of distributed computing frameworks and multi-GPU setups.
- Knowledge of distributed training frameworks like DeepSpeed. Prior experience in scaling neural network training and multi-GPU experiments is preferred.
- Some proficiency in CUDA/Triton programming and CUDA kernels optimization is preferred.
- Proficient in problem-solving and analytical reasoning.
- Exceptional communication and collaboration skills.
The annual base salary range for this position is $130,000 to $200,000, depending on the candidates experience, qualifications, and relevant skill set. The position is also eligible for an annual discretionary bonus. In addition, DRW offers a comprehensive suite of employee benefits including group medical, pharmacy, dental and vision insurance, 401k (with discretionary employer match), short and long-term disability, life and AD&D insurance, health savings accounts, and flexible spending accounts.
For more information about DRW's processing activities and our use of job applicants' data, please view our Privacy Notice at .
#J-18808-LjbffrAI Infrastructure Engineer
Posted 14 days ago
Job Viewed
Job Description
At HeyGen, our mission is to make visual storytelling accessible to all. Over the last decade, visual content has become the preferred method of information creation, consumption, and retention. But the ability to create such content, in particular videos, continues to be costly and challenging to scale. Our ambition is to build technology that equips more people with the power to reach, captivate, and inspire audiences.
Learn more at Visit our Mission and Culture doc here.
About HeyGen
HeyGen stands at the forefront of cutting-edge AI-powered platforms, revolutionizing the realm of video creation.
Position Summary:
At HeyGen, we are at the forefront of developing applications powered by our cutting-edge AI research. As an AI Infrastructure Engineer, you will lead the development of fundamental AI systems and infrastructure. These systems are essential for powering our innovative applications, including Photo Avatar, Instant Avatar, Streaming Avatar, and Video Translation. Your role will be crucial in enhancing the efficiency and scalability of these systems, which are vital to HeyGen's success.
Key Responsibilities:
- Design, build, and maintain the AI infrastructure and systems needed to support our AI applications. Examples include
- AI workflow scheduling system to improve GPU efficiency and throughput of our batch inference systems
- Model optimization to improve inference performance
- Auto Train systems to power our avatar models
- Large scale model evaluation systems
- Online model serving systems
- Collaborate with data scientists and machine learning engineers to understand their computational and data needs and provide efficient solutions.
- Stay up-to-date with the latest industry trends in AI infrastructure technologies and advocate for best practices and continuous improvement.
- Assist in budget planning and management of cloud resources and other infrastructure expenses.
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field
- 5+ years of experience
- Proven experience in managing infrastructure for large-scale AI or machine learning projects
- Excellent problem-solving skills and the ability to work independently or as part of a team.
- Proficiency in Python and C++
- Experience with GPU computing and optimizing computational workflows
- Familiarity with AI and machine learning frameworks like TensorFlow or PyTorch.
- Experience with CUDA
- Experience optimizing large deep learning model performance
- Experience building large scale batch inference system
- Prior experience in a startup or fast-paced tech environment.
- Competitive salary and benefits package.
- Dynamic and inclusive work environment.
- Opportunities for professional growth and advancement.
- Collaborative culture that values innovation and creativity.
- Access to the latest technologies and tools.
HeyGen is an Equal Opportunity Employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.
AI Infrastructure Engineer, Agents
Posted 19 days ago
Job Viewed
Job Description
As a Software Engineer on the ML Infrastructure team, you will design and build the platform for our agent sandboxing platform: the secure, high-performance code execution layer powering our agentic workflows. This system underpins critical applications and research initiatives, and is deployed across both internal and customer-managed environments.
This position requires deep expertise in systems engineering: operating systems, virtualization, networking, containers, and performance optimization. Your work will directly enable agents to execute untrusted or user-submitted code safely, efficiently, and repeatedly, and with fast startup times, strong isolation guarantees, and support for snapshotting and inspection.
You will:
- Design and build the sandboxing platform for code execution across containerized and virtualized environments.
- Ensure strong isolation, security, and reproducibility of execution across user sessions and workloads.
- Optimize for cold-start latency, memory footprint, and resource utilization at scale.
- Collaborate across security, infra, and product teams to support both internal research use cases and enterprise customer deployments.
- Lead architecture reviews and own projects from design through deployment in fast-paced, cross-functional settings.
- 3+ years of experience building high-performance systems software (e.g. OS, container runtime, VMM, networking stack).
- Deep understanding of Linux internals, process isolation, memory management, cgroups, namespaces, etc.
- Experience with containerization and virtualization technologies (e.g., Docker, Firecracker, gVisor, QEMU, Kata Containers).
- Proficiency in a systems programming language such as Go, Rust, or C/C++.
- Familiarity with networking, security hardening, sandboxing techniques, and kernel-level performance tuning.
- Comfort working across infrastructure layers, from kernel modules to orchestration frameworks (e.g., Kubernetes).
- Strong debugging skills and the ability to make performance/security tradeoffs in production systems.
- Familiarity with LLM agents and agent frameworks (e.g., OpenHands, Agent2Agent, MCP).
- Experience running secure workloads in multi-tenant or untrusted environments (e.g., FaaS, CI sandboxes, remote notebooks).
- Exposure to snapshotting and restore techniques (e.g., CRIU, VM snapshots, overlayfs).
Compensation packages at Scale for eligible roles include base salary, equity, and benefits. The range displayed on each job posting reflects the minimum and maximum target for new hire salaries for the position, determined by work location and additional factors, including job-related skills, experience, interview performance, and relevant education or training. Scale employees in eligible roles are also granted equity based compensation, subject to Board of Director approval. Your recruiter can share more about the specific salary range for your preferred location during the hiring process, and confirm whether the hired role will be eligible for equity grant. You'll also receive benefits including, but not limited to: Comprehensive health, dental and vision coverage, retirement benefits, a learning and development stipend, and generous PTO. Additionally, this role may be eligible for additional benefits such as a commuter stipend.
Please reference the job posting's subtitle for where this position will be located. For pay transparency purposes, the base salary range for this full-time position in the locations of San Francisco, New York, Seattle is:
$156,000-$225,600 USD
PLEASE NOTE: Our policy requires a 90-day waiting period before reconsidering candidates for the same role. This allows us to ensure a fair and thorough evaluation of all applicants.
About Us:
At Scale, we believe that the transition from traditional software to AI is one of the most important shifts of our time. Our mission is to make that happen faster across every industry, and our team is transforming how organizations build and deploy AI. Our products power the world's most advanced LLMs, generative models, and computer vision models. We are trusted by generative AI companies such as OpenAI, Meta, and Microsoft, government agencies like the U.S. Army and U.S. Air Force, and enterprises including GM and Accenture. We are expanding our team to accelerate the development of AI applications.
We believe that everyone should be able to bring their whole selves to work, which is why we are proud to be an inclusive and equal opportunity workplace. We are committed to equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability status, gender identity or Veteran status.
We are committed to working with and providing reasonable accommodations to applicants with physical and mental disabilities. If you need assistance and/or a reasonable accommodation in the application or recruiting process due to a disability, please contact us at Please see the United States Department of Labor's Know Your Rights poster for additional information.
We comply with the United States Department of Labor's Pay Transparency provision.
PLEASE NOTE: We collect, retain and use personal data for our professional business purposes, including notifying you of job opportunities that may be of interest and sharing with our affiliates. We limit the personal data we collect to that which we believe is appropriate and necessary to manage applicants' needs, provide our services, and comply with applicable laws. Any information we collect in connection with your application will be treated in accordance with our internal policies and programs designed to protect personal data. Please see our privacy policy for additional information.
AI Infrastructure Engineer - Autonomy
Posted 22 days ago
Job Viewed
Job Description
Applied Intuition is the vehicle intelligence company that accelerates the global adoption of safe, AI-driven machines. Founded in 2017, Applied Intuition delivers the toolchain, Vehicle OS, and autonomy stacks to help customers build intelligent vehicles and shorten time to market. Eighteen of the top 20 global automakers and major programs across the Department of Defense trust Applied Intuition's solutions to deliver vehicle intelligence. Applied Intuition services the automotive, defense, trucking, construction, mining, and agriculture industries and is headquartered in Mountain View, CA, with offices in Washington, D.C., San Diego, CA, Ft. Walton Beach, FL, Ann Arbor, MI, London, Stuttgart, Munich, Stockholm, Seoul, and Tokyo. Learn more at appliedintuition.com.
We are an in-office company, and our expectation is that employees primarily work from their Applied Intuition office 5 days a week. However, we also recognize the importance of flexibility and trust our employees to manage their schedules responsibly. This may include occasional remote work, starting the day with morning meetings from home before heading to the office, or leaving earlier when needed to accommodate family commitments. (Note: For EpiSci job openings, fully remote work will be considered by exception.)
About the role
We are looking for both infrastructure engineers with expertise in machine learning pipelines and ML engineers that want to work beyond modeling to join our AI Infrastructure group. This role will work across the entire AI lifecycle (dataset generation, training frameworks, compute, evaluation, and deployment) and work directly with modeling teams. This team is a good fit if you are excited to work on broad, ambiguous problems and develop across the entire ML stack. At Applied Intuition, we encourage all engineers to take ownership over technical and product decisions, closely interact with external and internal users to collect feedback, and contribute to a thoughtful, dynamic team culture.
At Applied Intuition, you will:
- Design and build training, inference, and evaluation infrastructure to support our current autonomy stack development, orchestrating massive GPU clusters to process petabytes of multimodal sensor data
- Optimize multimodal data ingestion and preprocessing pipelines (LiDAR, camera, radar, map priors) to support cutting-edge perception and planning model development
- Work across cloud environments to support high-throughput distributed training
- Collaborate closely with the AI research team and autonomy teams
- Technologies: Pytorch, CUDA, Ray, Flyte, K8s
- Experience with building software components to address production, full-stack machine learning challenges.
- Opinions about building a company-wide platform for ML training, evaluation, and deployment
- Knowledge of the open source landscape with judgment on when to choose open source versus build in-house
- Excellent analytical and problem-solving skills
Compensation at Applied Intuition for eligible roles includes base salary, equity, and benefits. Base salary is a single component of the total compensation package, which may also include equity in the form of options and/or restricted stock units, comprehensive health, dental, vision, life and disability insurance coverage, 401k retirement benefits with employer match, learning and wellness stipends, and paid time off. Note that benefits are subject to change and may vary based on jurisdiction of employment.
Applied Intuition pay ranges reflect the minimum and maximum intended target base salary for new hire salaries for the position. The actual base salary offered to a successful candidate will additionally be influenced by a variety of factors including experience, credentials & certifications, educational attainment, skill level requirements, interview performance, and the level and scope of the position.
Please reference the job posting's subtitle for where this position will be located. For pay transparency purposes, the base salary range for this full-time position in the location listed is: $153,000 - $222,000 USD annually.
Don't meet every single requirement? If you're excited about this role but your past experience doesn't align perfectly with every qualification in the job description, we encourage you to apply anyway. You may be just the right candidate for this or other roles.
Applied Intuition is an equal opportunity employer and federal contractor or subcontractor. Consequently, the parties agree that, as applicable, they will abide by the requirements of 41 CFR 60-1.4(a), 41 CFR 60-300.5(a) and 41 CFR 60-741.5(a) and that these laws are incorporated herein by reference. These regulations prohibit discrimination against qualified individuals based on their status as protected veterans or individuals with disabilities, and prohibit discrimination against all individuals based on their race, color, religion, sex, sexual orientation, gender identity or national origin. These regulations require that covered prime contractors and subcontractors take affirmative action to employ and advance in employment individuals without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, protected veteran status or disability. The parties also agree that, as applicable, they will abide by the requirements of Executive Order 13496 (29 CFR Part 471, Appendix A to Subpart A), relating to the notice of employee rights under federal labor laws.
AI Infrastructure Engineer - PlayerZero
Posted 22 days ago
Job Viewed
Job Description
A stealth-stage AI infrastructure company is building a self-healing system for software that automates defect resolution and development. The platform is used by engineering and support teams to:
- Autonomously debug problems in production software
- Fix issues directly in the codebase
- Prevent recurring issues through intelligent root-cause automation
We believe that as software development accelerates, the burden of maintaining quality and reliability shifts heavily onto engineering and support teams. This challenge creates a rare opportunity to reimagine how software is supported and sustained -with AI-powered systems that respond autonomously.
About the Role
We're looking for an experienced backend/infrastructure engineer who thrives at the intersection of systems and AI - and who loves turning research prototypes into rock-solid production services. You'll design and scale the core backend that powers our AI inference stack - from ingestion pipelines and feature stores to GPU orchestration and vector search.
If you care deeply about performance, correctness, observability, and fast iteration , you'll fit right in.
What You'll Do
- Own mission-critical services end-to-end - from architecture and design reviews to deployment, observability, and service-level objectives.
- Scale LLM-driven systems : build RAG pipelines, vector indexes, and evaluation frameworks handling billions of events per day.
- Design data-heavy backends : streaming ETL, columnar storage, time-series analytics - all fueling the self-healing loop.
- Optimize for cost and latency across compute types (CPUs, GPUs, serverless); profile hot paths and squeeze out milliseconds.
- Drive reliability : implement automated testing, chaos engineering, and progressive rollout strategies for new models.
- Work cross-functionally with ML researchers, product engineers, and real customers to build infrastructure that actually matters.
- Have 2-5+ years of experience building scalable backend or infra systems in production environments
- Bring a builder mindset - you like owning projects end-to-end and thinking deeply about data, scale, and maintainability
- Have transitioned ML or data-heavy prototypes to production , balancing speed and robustness
- Are comfortable with data engineering workflows : parsing, transforming, indexing, and querying structured or unstructured data
- Have some exposure to search infrastructure or LLM-backed systems (e.g., document retrieval, RAG, semantic search)
- Experience with vector databases (e.g., pgvector, Pinecone, Weaviate) or inverted-index search (e.g., Elasticsearch, Lucene)
- Hands-on with GPU orchestration (Kubernetes, Ray, KServe) or model-parallel inference tuning
- Familiarity with Go / Rust (primary stack), with some TypeScript for light full-stack tasks
- Deep knowledge of observability tooling (OpenTelemetry, Grafana, Datadog) and profiling distributed systems
- Contributions to open-source ML or systems infrastructure projects
Let me know if you'd like a version optimized for careers pages, job boards, or stealth pitch decks.
AI Infrastructure Architect
Posted 12 days ago
Job Viewed
Job Description
Get to know Okta
Okta is The World's Identity Company. We free everyone to safely use any technology, anywhere, on any device or app. Our flexible and neutral products, Okta Platform and Auth0 Platform, provide secure access, authentication, and automation, placing identity at the core of business security and growth.
At Okta, we celebrate a variety of perspectives and experiences. We are not looking for someone who checks every single box - we're looking for lifelong learners and people who can make us better with their unique experiences.
Join our team! We're building a world where Identity belongs to you.
About the Role
We are looking for a smart and versatile AI Infrastructure Architect to build and evolve the AI infrastructure and platform that powers our identity security solutions. Your work will enable internal teams and product groups to integrate AI capabilities safely, securely, and at scale-empowering Okta's mission to protect millions of digital identities worldwide. While your primary focus will be to architect scalable, secure, and resilient infrastructure supporting AI-driven tools, frameworks, and identity services, we value someone who isn't afraid to get hands-on when needed to help solve complex challenges and drive projects forward.
Key Responsibilities- Lead AI enablement initiatives, including proof-of-concepts for emerging AI infrastructure technologies and integration approaches.
- Collaborate cross-functionally with engineering, security, data science, and product teams to align AI platform architecture with business and security goals.
- Architect scalable, resilient, and secure AI infrastructure that supports AI-powered tools and features across Okta's Identity Platform.
- Lead infrastructure decisions across AWS, GCP, or hybrid environments with a focus on secure identity data handling
- Develop and maintain infrastructure-as-code frameworks (e.g., Terraform, Helm) to ensure consistent, reproducible deployment of AI services
- Champion security and compliance by embedding data privacy and identity protection standards directly into the AI platform and infrastructure design.
- Serve as the key advocate and strategist for AI-driven efficiency initiatives across infrastructure platform teams and pre-production systems.
- Implement robust MLOps practices, such as model evaluation, rollback strategies, and A/B testing, to guarantee the reliability and governance of AI in production.
- Drive continuous innovation by staying current with AI and cloud infrastructure trends and evangelizing best practices internally.
Required
- 10+ years in infrastructure or software engineering, with 2 years building AI/ML systems
- Exceptional systems level thinking and a track record in architecting and building enterprise grade infrastructure
- Deep expertise in cloud platforms (AWS, GCP), distributed systems, and container orchestration (Kubernetes)
- Expected to be very hands-on in order to create, review, and contribute large chunks of quality code
Preferred
- Experience in identity, security, fraud, or risk analytics domains.
- Experience operationalizing large language models or foundation models in production environments.
- Contributions to MLOps or infrastructure open-source projects.
What You'll Gain
- Opportunity to lead infrastructure shaping AI systems that protect millions of identity transactions.
- Be at the core of building efficient and AI powered enterprise grade solutions that touch internal and external customers alike.
#LI-TM
#LI-Hybrid
Below is the annual base salary range for candidates located in California. Your actual base salary will depend on factors such as your skills, qualifications, experience, and work location. In addition, Okta offers equity (where applicable), bonus, and benefits, including health, dental and vision insurance, 401(k), flexible spending account, and paid leave (including PTO and parental leave) in accordance with our applicable plans and policies. To learn more about our Total Rewards program please visit:
The annual base salary range for this position for candidates located in the San Francisco Bay area is between: $263,000—$395,000 USDWhat you can look forward to as a Full-Time Okta employee!
- Amazing Benefits
- Making Social Impact
- Developing Talent and Fostering Connection + Community at Okta
Okta cultivates a dynamic work environment, providing the best tools, technology and benefits to empower our employees to work productively in a setting that best and uniquely suits their needs. Each organization is unique in the degree of flexibility and mobility in which they work so that all employees are enabled to be their most creative and successful versions of themselves, regardless of where they live. Find your place at Okta today!
Some roles may require travel to one of our office locations for in-person onboarding.
Okta is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, ancestry, marital status, age, physical or mental disability, or status as a protected veteran. We also consider for employment qualified applicants with arrest and convictions records, consistent with applicable laws.
If reasonable accommodation is needed to complete any part of the job application, interview process, or onboarding please use this Form to request an accommodation.
Okta is committed to complying with applicable data privacy and security laws and regulations. For more information, please see our Privacy Policy at
Be The First To Know
About the latest Ai infrastructure Jobs in United States !
AI Infrastructure Specialist
Posted today
Job Viewed
Job Description
Responsibilities:
- Design, build, and manage scalable and efficient cloud-based (AWS, Azure, GCP) and on-premise infrastructure for AI/ML model training and deployment.
- Optimize hardware and software configurations for AI workloads, including GPU clusters, high-performance computing (HPC) environments, and distributed storage solutions.
- Implement and maintain MLOps pipelines, CI/CD processes for AI models, and automated deployment strategies.
- Monitor system performance, identify bottlenecks, and implement solutions to enhance efficiency and reduce latency.
- Ensure the security and compliance of AI infrastructure, implementing best practices for data protection and access control.
- Collaborate with data scientists, ML engineers, and software developers to understand their infrastructure needs and provide tailored solutions.
- Stay abreast of the latest advancements in AI infrastructure technologies, tools, and methodologies.
- Develop and maintain comprehensive documentation for infrastructure architecture, configurations, and operational procedures.
- Troubleshoot complex infrastructure issues, providing timely resolution to minimize downtime.
- Manage containerization technologies (Docker, Kubernetes) for AI model deployment and orchestration.
Qualifications:
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
- Proven experience in designing, deploying, and managing large-scale infrastructure, with a focus on AI/ML workloads.
- Deep understanding of cloud platforms (AWS, Azure, GCP) and their services relevant to AI/ML.
- Expertise in infrastructure as code (IaC) tools such as Terraform or Ansible.
- Proficiency in containerization and orchestration technologies (Docker, Kubernetes).
- Strong knowledge of networking principles, storage solutions, and security best practices.
- Experience with MLOps tools and frameworks (e.g., Kubeflow, MLflow).
- Proficiency in scripting languages like Python or Bash.
- Excellent analytical and problem-solving skills.
- Strong communication and collaboration abilities, with the capacity to explain complex technical concepts to diverse audiences.
AI Infrastructure Abstraction Engineer
Posted 2 days ago
Job Viewed
Job Description
Meet the Team
We are an innovation team on a mission to transform how enterprises harness AI. Operating with the agility of a startup and the focus of an incubator, we're building a tight-knit group of AI and infrastructure experts driven by bold ideas and a shared goal: to rethink systems from the ground up and deliver breakthrough solutions that redefine what's possible - faster, leaner, and smarter.
We thrive in a fast-paced, experimentation-rich environment where new technologies aren't just welcome - they're expected. Here, you'll work side-by-side with seasoned engineers, architects, and thinkers to craft the kind of iconic products that can reshape industries and unlock entirely new models of operation for the enterprise.
If you're energized by the challenge of solving hard problems, love working at the edge of what's possible, and want to help shape the future of AI infrastructure - we'd love to meet you.
Your Impact
As an AI Infrastructure Abstraction Engineer , you will help shape the next generation of AI compute platforms by designing systems that abstract away hardware complexity and expose logical, scalable, and secure interfaces for AI workloads. Your work will enable multi-tenancy, resource isolation, and dynamic scheduling of GPUs and accelerators at scale - making infrastructure programmable, elastic, and developer-friendly.
You will bridge the gap between raw compute resources and AI/ML frameworks, allowing infrastructure teams and model developers to consume shared GPU resources with the performance and reliability of bare metal, but with the flexibility of cloud-native systems. Your contributions will empower internal and external users to run AI workloads securely, efficiently, and predictably - regardless of the underlying hardware topology.
This role is critical to enabling AI infrastructure that is multi-tenant by design, scalable in practice, and abstracted for portability across diverse platforms.
KEY RESPONSIBILITIES
- Design and implement infrastructure abstractions that cleanly separate logical compute units (vGPUs, GPU pods, AI queues) from physical hardware (nodes, devices, interconnects) .
- Develop runtime services, APIs, and control planes to expose GPU and accelerator resources to users and frameworks with multi-tenant isolation and QoS guarantees .
- Architect systems for secure GPU sharing , including time-slicing, memory partitioning, and namespace isolation across tenants or jobs.
- Collaborate with platform, orchestration, and scheduling teams to map logical resources to physical devices based on utilization, priority, and topology.
- Define and enforce resource usage policies , including fair sharing, quota management, and oversubscription strategies.
- Integrate with model training and serving frameworks (e.g., PyTorch, TensorFlow, Triton) to ensure smooth and predictable resource consumption.
- Build observability and telemetry pipelines to trace logical-to-physical mappings, usage patterns, and performance anomalies.
- Partner with infrastructure security teams to ensure secure onboarding, access control, and workload isolation in shared environments.
- Support internal developers in adopting abstraction APIs, ensuring high performance while abstracting away low-level details.
- Contribute to the evolution of internal compute platform architecture, with a focus on abstraction, modularity, and scalability.
Minimum Qualifications:
- Bachelors + 15 years of related experience, or Masters + 12 years of related experience, or PhD + 8 years of related experience
- Experience building scalable, production-grade infrastructure components or control planes using Go, Python, and C++ ,
- Experience with Kubernetes, Docker or Kubevirt for virtualization, containerization , and orchestration frameworks
- Experience designing or implementing logical resource abstractions for compute, storage, or networking with a focus in multi-tenant environments .
- Experience integrating with AI/ML platforms or pipelines (e.g., PyTorch, TensorFlow, Triton Inference Server, MLFlow).
Preferred Qualifications:
- Experience with GPU sharing, scheduling, or isolation techniques (e.g., MPS, MIG, time-slicing, device plugin frameworks, or vGPU technologies).
- Solid grasp of resource management concepts including quotas, fairness, prioritization, and elasticity.
#WeAreCisco
#WeAreCisco where every individual brings their unique skills and perspectives together to pursue our purpose of powering an inclusive future for all.
Our passion is connection-we celebrate our employees' diverse set of backgrounds and focus on unlocking potential. Cisconians often experience one company, many careers where learning and development are encouraged and supported at every stage. Our technology, tools, and culture pioneered hybrid work trends, allowing all to not only give their best, but be their best.
We understand our outstanding opportunity to bring communities together and at the heart of that is our people. One-third of Cisconians collaborate in our 30 employee resource organizations, called Inclusive Communities, to connect, foster belonging, learn to be informed allies, and make a difference. Dedicated paid time off to volunteer-80 hours each year-allows us to give back to causes we are passionate about, and nearly 86% do!
Our purpose, driven by our people, is what makes us the worldwide leader in technology that powers the internet. Helping our customers reimagine their applications, secure their enterprise, transform their infrastructure, and meet their sustainability goals is what we do best. We ensure that every step we take is a step towards a more inclusive future for all. Take your next step and be you, with us!
AI Infrastructure Support Engineer
Posted 14 days ago
Job Viewed
Job Description
AI Infrastructure Support Engineer
At BNY, our culture allows us to run our company better and enables employees' growth and success. As a leading global financial services company at the heart of the global financial system, we influence nearly 20% of the world's investible assets. Every day, our teams harness cutting-edge AI and breakthrough technologies to collaborate with clients, driving transformative solutions that redefine industries and uplift communities worldwide.
Recognized as a top destination for innovators and champions of inclusion, BNY is where bold ideas meet advanced technology and exceptional talent. Together, we power the future of finance - and this is what #LifeAtBNY is all about. Join us and be part of something extraordinary.
We're seeking a future team member for the role of AI Infrastructure Support Engineer to join our AIHUB team. This role is located in Pittsburgh, PA (4 days in office required).
In this role, you'll make an impact in the following ways:
- Provide operational support and issue resolution for AI infrastructure and platform services.
- Manage AI deployments and monitor workloads on distributed systems using Kubernetes and Docker.
- Develop and maintain scripts and automation using Python and Shell scripting.
- Extensively use infrastructure ticketing systems such as JIRA and ServiceNow.
- Collaborate with cross-functional teams and be well versed in DevOps practices including CI/CD (GitLab), containerization (Docker), and secure access control (Azure AD).
- Assist with documenting infrastructure processes, runbooks, and architectural components.
- Troubleshoot networking and storage-related issues across AI workloads.
- Participate in periodic on-call rotations, including occasional weekend coverage.
To be successful in this role, we're seeking the following:
- Bachelor's degree in computer science or a related discipline or equivalent work experience required. 6 - 8 years of application or infrastructure related experience required; experience in the securities or financial services industry is a plus
- 4+ years of experience in infrastructure engineering, DevOps, or site reliability roles.
- Hands-on experience with Kubernetes, Docker, HELM, and GitLab CI/CD.
- Proficiency in Python and Bash/Shell scripting for automation and operational tooling.
- Familiarity with enterprise tools such as JIRA, ServiceNow, and Azure AD.
- Working knowledge of networking fundamentals and storage systems, especially as they apply to distributed AI systems.
- Strong documentation and collaboration skills in a cross-functional, fast-paced environment.
- Willingness to participate in weekend or after-hours support rotations as needed.
At BNY, our culture speaks for itself, check out the latest BNY news at:
BNY Newsroom
BNY LinkedIn
Here's a few of our recent awards:
* America's Most Innovative Companies, Fortune, 2025
* World's Most Admired Companies, Fortune 2025
* "Most Just Companies", Just Capital and CNBC, 2025
Our Benefits and Rewards:
BNY offers highly competitive compensation, benefits, and wellbeing programs rooted in a strong culture of excellence and our pay-for-performance philosophy. We provide access to flexible global resources and tools for your life's journey. Focus on your health, foster your personal resilience, and reach your financial goals as a valued member of our team, along with generous paid leaves, including paid volunteer time, that can support you and your family through moments that matter.
BNY is an Equal Employment Opportunity/Affirmative Action Employer - Underrepresented racial and ethnic groups/Females/Individuals with Disabilities/Protected Veterans.