엔비디아 독점 인터뷰 | 블랙웰 GPU 핵심 설계 리더가 직접 밝힌 AI 팩토리와 반도체 시장의 미래 - Insights | Molayo

Video: 엔비디아 독점 인터뷰 | 블랙웰 GPU 핵심 설계 리더가 직접 밝힌 AI 팩토리와 반도체 시장의 미래
Channel: 안될공학 - IT 테크 신기술
Duration: 19m 42s
Source: subtitle (auto, ko)
Transcript:
So fundamentally
Vera is the CPU for
the age ofic AI and
multiple.네
네 여러분 반갑습니다. 여러입니다.
지금 제가 GTC 타이페이 와서
여기서 직접 만나뵐 수 있는
엔비디아의 핵심 인무를 만날 수 있게
되었는데요. 지금 루빈 GPU의
출시를 앞두고 있는 상황에서 현재로서
역사상 가장 강력한 AI 칩으로
평가받는 블랙엘의 출시를 직접
현장에서 진의한 임무를 만나게
되었습니다. 바로 샤르 나라심만
디렉터 이분이신데요. 엔비디아에서
AI와 데이터 센터 GPU 제품
마케팅 총괄 디렉터입니다. 이게 지금
엔비디아가 단순한 반도체이 팬리스
기업을 넘어 가지고 전 세계 AI와
관련된 모든 인프라를 바꾸고 있는
상황인데 M비디아의 진짜 전략이
무엇인지 앞으로 AI 시대의 미래를
바꿀이 인프라와 칩셋의 비밀을
지금부터 하나씩 살펴볼 건데요. 이번
GTC 타이 키노트에서도 무엇을
강조했느냐? 과거에는 단순히 얼마나
빠른 더 빠른 GPU인가를 따졌다라고
하면은 이제 시장의 질문이 달라졌죠.
AI의 대량의 데이터를 스스로
처리하고 가치를 뭔가 만들어내는
엔비디아가 얘기하는 일종의 AI
팩토리라는 시대가 도래를 하면서 이제
고객들이 토큰당 생성 비용이랑 와트당
효율성 같은 철자이 상업적 가치에
주목을 하기 시작했다라는 거죠.
여기에 대해서 직접 물어보았습니다.
AI factory not as
infrastructure that
create values. So
from a market
perspective
customers seem to be
looking faster GPUs
and asking how
efficiently they can
produce and monetize
tokens. So from a
data center GPU
perspective how are
customers changing
the way they
evaluate AI factory
investments and
should we now think
less in terms of big
performance and more
in terms of cost per
token, tokens per
wat and GPU
utilization? How how
much? So the most
important metric
that customers look
at is how do they
minimize the token
cost? They want the
lowest token cost
possible. Now, in
Jenson's keynote, he
actually gave a
chart where he
showed you sort of
the relationship
between how quickly
you can get your AI
factory stood up,
how many tokens it
can actually
generate, which uh
directly corresponds
to your revenue.
Because right now,
anyone who can sell
a token, anyone who
can create a token
can automatically
sell a token because
there is so much
demand for for AI
tokens. And you want
to minimize your
token cost because
that maximizes your
profit. And then the
last thing you look
at is what is the
longevity of that AI
factory? Right now
we see even uh cloud
instances of hopper
GPUs in very high
demand. So the
things that
customers most look
minimize their token
cost but right
behind that are the
considerations for
how quickly can they
stand up the data
center or the AI
factory and
following our
reference designs
enables them to do
that. How many
tokens can they
produce because that
directly translates
into their revenues.
And since we have
the most performant
platform, we can
actually generate
the most tokens and
maximize their
revenues and th
their profits. And
th and also because
we constantly
maintain and speed
up our
infrastructure with
software
improvements, we
also allow you to
extend the lifetime
and lifespan of that
AI factory. So on
all of these fronts,
Nvidia is really
focus on maximizing
what the customers
are able to receive.
자, AI 기술이 이제 지시에 답을
하는 수준을 넘어 가지고 스스로
계획을 세우고 코드를 실행하고 뭐
결과까지 도출하고 이런 에이전틱
AI로 진화를 하고 있는 상황인데
사실이 말이 AI 연상 과정이 그냥
단순하게 한번 물어보고 답하는 이런
1회성 답변이 아니라 진짜 매우 길고
복잡한 어떤 워크플로우로 바뀐다라는
걸 얘기를 하는 거잖아. 이게
핵심이거든요. 당연히 데이터 센터
내부의 GPU, CPU 그리고
네트워크 간의 데이터 이동 방식
자체가 완전히 달라져야 한다라는게
이번 키노트에서도 확실히 강조가
됐는데 기존의 AI 학습용 하드웨어
클러스터랑 비교했을 때에이 에이전틱
AI 시대의 플랫폼이 어떻게 밸런스를
유지를 해야 하는지 NBI의 기술적
해법을 좀 더 자세히 들어봤습니다.
Agent AI inference
is no longer just to
generation it is
becoming a long
workflow involving
planning retrable to
use code etc. So how
does this workload
shift change data
center GPU platform
design compared with
traditional training
happy GPU clusters
and where does the
balance shift most
across GPU compute
CPU orchestration
and memory bandwidth
and networking? We
have always designed
our GPUs to be very
high performance for
both training as
well as inference.
You may recall that
we were the first uh
platform provider to
actually pioneer FP4
and specifically
NVFP4. NVFP4 is not
just for four bit
floating point. It
is an entire uh
format of that
includes tensors and
scaling factors to
absolutely allow you
to compress your FP8
parameters and
values down to FP4
and minimize your
storage use. So we
actually look at the
entire
microchitecture.
Where are the areas
that we can keep
accelerating and
every time we
identify a new
bottleneck, you
know, we're
following's law
throughout the
entire AI factory.
As you accelerate
one portion of the
data center, you end
up finding a
bottleck somewhere
else. So for
example, we really
accelerated GPUs to
the point where we
found that they
needed more data
from storage and
from other parts of
the data center. As
a result, we went
into networking and
buyed the NET
Melanox acquisition.
Subsequent to that,
we saw CPUs were
slowing our GPUs
down because they
became the next
bottleneck. And
that's why we
started building our
own CPUs and now we
have the best CPUs
when it comes to
entic AI workloads.
So as we look across
the entire AI
factory we identify
where is the next
bottleneck that we
can accelerate
because we look at
it from a holistic
standpoint of the
entire workflow.
자, 이렇게 에이전틱 AI 시대가
열리면서 뜻밖에요 CPU 역할이 또
다시금 도마 위에 올랐죠.이 이
AI가 스스로 복잡한 테스크를
제어하고 뭐 파이썬 코드도 실행하고
하드웨어로도 조율를 하고 이런 CPU
사이드의 연산 부담이 굉장히 급증을
했기 때문에 중요해진 건데 이에
대응해서 엔비디아가 에이전틱 AI에
최적화된 새로운 프로세서이 베라
CPU라 거를 이번에 젠슨
키노트에서도 그리고 젠슨 프레스
Q&A에서도 굉장히 강조를
했습니다.이 베라 CPU에 대해서 좀
더 자세히 질문을 드려봤는데요.
NBIDIA introduce
Vera aspilt for AI
agents. In aici. CPU
side workloads such
as Python run times
orchestration logic
code execution are
becoming much more
important. So from a
GPU platform
perspective what is
the core role of
Vera CPU? Is Vera
mainly designed to
keep GPUs better
utilized or should
we see it as a new
kind of data center
CPU built to run
agentic workflace
more efficiently? So
fundamentally Vera
is the CPU for the
age of entic AI and
it has multiple very
key innovations. It
has basically
redesigned the CPU
from the groundup
for these agentic
workflows. It has
very high single
threaded single core
performance. It has
massive memory
bandwidth to load
and move data
around. And it has a
scalable coherent
fabric that spans
and allows you to
move data and
instructions across
the entire die of
the CPU. There is no
chiplet tax when it
comes to moving
information around
WVER because you're
not crossing die to
die boundaries. Vera
is designed
specifically for how
do we get data to
the core that's
going to complete
the calculation? How
do we complete that
very quickly and
then how do we send
that response back
over to the to the
GPU? That's one
aspect of Vera. The
other aspect is
exactly what you
just touched on.
It's also an
orchestration CPU if
you will. Because we
have the NV link CC.
That was another
innovation that we
introduced. We saw
that to keep the
GPUs fed with data
in a timely and
avoid CPU becoming
bottleck this aspect
of CCV.
자, 반도체 성능을 높이는 거는
단순히 하나의 칩을 이제 만드는
것으로 불가능하고 그래서이 차세대
플랫폼 베라루빈 기준으로 봤을 때는
GPU, CPU 또 인터커넥트
소프트웨어까지 시스템 전체를 하나로
잘 아우를 수 있는 하나의 거대한
컴퓨터처럼 동시에 설계하는 말 그대로
익스트림 코 디자인이라고 하는
극단적인 공동 설계가 정점으로
보이는데요.이 이 에이전트 연상
과정에서 발생하는 수많은 병목 현상을
해결하기 위해서 베라루빈 플랫폼의 선
혁신의 핵심이 무엇인지도 질문을 해
보았습니다. Next question
was about Vera Ruen
so Vera Ruben seems
to be more than just
a GPU generation it
looked like AI
factory platform
could cesigned
across GPU CPU
networking all the
things with like
software. In agentic
AI a single request
can expand into
multiple stage of
reasoning retrieval
to use things like
that and making to
group more important
now. What are the
biggest technical
contributors to Vera
Rubin's performance
games and is the
biggest difference
coming from the
Rubin GPU
architecture itself
or the vera CPU
integration Lake
fabric memory
hierarchy or
networking all of
them? It's actually
all of them. So you
alluded to extreme
codesign. We look at
the entire workflow,
identify where are
all the bottlenecks
that are going to
start appearing.
Nvidia is fortunate
to have our own
inhouse research
team uh when it
comes to AI models.
So we release our
own open source
models. You may be
familiar with
Nemotron 3 and
Nemotron 3 Ultra. Um
we release our own
models. So we have a
research team that
is actively involved
in developing state
of the art and in
where these large
language models are
going. As a result
of that we can see
what needs are
coming up on the
horizon. When it
came to we need to
move data quickly
across the entire AI
factory, we needed a
better networking
and that's why we
started coming up
with our own product
line there. When it
came to these entic
workflows, they are
going to have
multiple calls. In
fact, anentic loop
can have 160 plus
terms where it's
repeatedly it's the
same as you making
160 rapid prompt
inquiries to a
model. Being able to
hit that, you need
to have very rapid
GPUs to do all of
those computations.
You need very fast
CPUs that can
actually do
validation each of
the infront cycles
that the GPU is
making. So by
looking across this
entire data center,
this entire AI
factory, we
identified, we
looked at law, we
looked at what is
coming in terms of
the workload. How
can we identify the
need to remove? And
then how do we
create a product
that is highly
optimized to improve
performance and
remove that
bottleneck? So you
know when
competitors say well
we have a bigger
rack they're not
really competing
against the NBL72.
They are actually
competing against
the entire Ruben
pod. And the
productivity and the
performance of that
pod far outstrips
anything else in the
market. It's also
in proven
benchmarksmitted
code
repositories.

자, 지금이 순간에도 블랙겔이 전
세계에 엄청나게 많이 깔리고 있죠.
그러면서 다음 세대인 베라루빈을 지금
고민을 하고 있는 상황이니까 사실이
K팩스가 엄청나게 지금 투입되고 돈도
더 빌리고 유상증자하고 뭐 여러 가지
일들이 많잖아요. 이런 기존 인프라
투자가 낭비되지가 않고 실제로 돈을
벌려고 하면은 결국 뭔가 블랙에서
베라루빈으로 넘어가는이 전환기 때
세대가 바뀌에 따라서 인프라 연속성과
데이터 아키텍처의 핵심 변화는
무엇인가? 이게 또 실제로 인프라를
사는 입장에서 중요한 포인트일 것
같아서 또 질문을 드렸는데요.
AI factories while
also having to think
about rubin and
future generations.
So for them is
critical to
understand how much
of today's GPU
infrastructure
investment can carry
forward. As customer
move from Blackwell
to V Ruben, what is
the most important
technical transition
from a data center
perspective? And
which area will be
the biggest turning
point, a GPU memory
or networking
fabric? So the
really nice thing
about working with
the Nvidia platform
is we always
maintain backwards
compatibility. CUDA
preserved that for
over a decade. The
GPU that you buy
today will be
compatible with any
code that you
develop for a prior
generation. Going
from Blackwell to
Ver Ruben, we are
maintaining that
NBL72 Oberon rack
architecture. So
customers will have
a seamless
transition if they
don't want to change
anything else about
their AI factory.
They can simply move
in the Ver Ruben
rack into the place
of the the black
that used to be in
that place. There's
a minor change in
the in the uh the
amount of power
that's required for
the Ver Ruben rack,
the NBL, the new
NBL72. But in terms
of form factor,
deployment, software
compatibility all of
those things are
backwards compa 자,
여기서 또 메모리를 질문 안 할 수가
없죠. 결국 GPU의 연산
속도만큼이나 데이터를 빠르게 주고받고
하는이 메모리 대역폭 밴드위스가
중요하기 때문에 HBM과 관련해서 좀
질문을 드려 봤습니다. From a
data center GB
perspective, what is
the biggest pressure
context inference
and agent workers
place on HBM? Will
feature inference
performance be
driven more by row
compet or by
architectures that
use HPM and other
memory movement? So
one of the nice
things about
practicing this
philosophy of the
extreme codesign is
we can actually look
at how we can
introduce innovation
and optimize our
needs to what is
available from the
ecosystem and
production capacity
as a whole. So I'll
draw your attention
to the vera CPU. It
actually uses LPDDR
and when we first
brought that
innovation about it
uh you know it kind
of changes the
mentality in the
industry but we
bring about uh low
power to memory. But
we also introduce
built-in error
correction. So you
have now enterprise
grade memory. It's
lower cost and it
allows us to bring
innovation into a
very specific aspect
of memory. HPM is
going to continue to
be important. It's
ites right next to
the GPU and
continuse that we
can look across that
and
이번에 비디아가 키노트에서도 젠슨
CEO가 우리가 진짜 네트워크 최강
회사다 이런 식으로 표현을 하기도
했거든요. 결국 에이전트 AI가
굉장히 긴 장문맥의 어떤 거를 기억을
할 수 있어야 되고 복잡한 인무를
계속 시도를 하려고 하면은 디램에 다
얹지기 힘드니까이 블루필드 4위반의
STX라고 하는이 가속 스토리지
인프라 스토리지라고 하는 거는 여기
지금 랜드 플래시가 들어갔다라는
거죠. 그래서이 데이터 센터에서이
스토리지가 그냥 저장소의 역할이 하는
것이 아니라 장문맥의 어떤 실시간
확장 메모리에 대한 얘기를 하는
것으로 보이기 때문에 여기에 대해서도
한번 물어봤습니다. So for
agent AI to perform
long and complex
tasks the model
cannot rely only on
the prompt it needs
to continuously read
long documents pass
context to output
and enterprise data.
Sob has introduced
the bluepit 4 SDX as
accelerated storage
infrastructure for
long contest
reasoning. So from a
perspective will
storage evolve from
simply storing data
to acting more like
context memory for
agentic AI the as
the agents queries
get their tool
calling and their
requests and the
prompt sizes get
bigger they will
need more access to
to fast memory.
That's one of the
reasons why we
introduced the STX
rack. Having the
Bluefield 4 allows
us to do policy
management,
security, protecting
both the data as
well as protecting
what agents are
doing when they are
in the data center.
Jenson announced our
new Doka security
product at his keye
yesterday and
bringing about these
innovations allows
us to
accessible agents
latency
마지막으로 원래 지난번에는 루빈
CPX라고 GDDR이 들어가 있는이
프리필과 디코드에서 프리필을 책임지는
요런 시스템에 대해서 얘기가 했다가
요즘은 요러한 얘기가 줄어들고
LPX로 바로 전환을 하게 됐잖아요.
요 그러면서 뭐 CPX 이제 앞으로
안 나올 거다. 뭐 이런 얘기도 있고
막 여러 가지 얘기가 많은데 중요한
거는 이렇게 거대하고 이렇게 큰
회사가 이런 의사 결정을 굉장히
민첩하게 하는게 굉장히 전
놀랍더라고요. 그래서 거기에 대해서
실제로이 실무를 하고 계신 분께
NBI 문화라든가 이런 거에 대해서
좀 힌트를 얻을 수 있까 해서 또
질문을 드렸는데요.
People often thought
training was their
hard point while
influence was
relatively easy the
Jesson said and but
in the age of the
agency AI inf is no
longer just simple
talk generation it
has become a complex
systems problem
involving planning
code execution etc
my real question is
uh the Rubin CPX uh
was originally
understood as a
dedicated
accelerator for the
context or previous
phase and but in the
latest ver Rubin
announcement LPX
seems to have taken
a much more central
role. So should we
read this as a sign
that Nvidia is not
following a fixed
chip roadmap but
quickly uh
reconfiguring CPU
GPU and other
network LPUs
memories
interconnects around
uh shifting agentic
AI bots to optimize
the total cost of
the AI factory and
what is the core
capability that
allows Nvidia to
make this kind of
system level
architecture pivote
so quickly I think
it is part of our
culture to we we
have a very open and
transparent culture
internally. Teams
share a lot of
information quickly.
We are blessed to
have our own
does model research
on AI models where
uh if you look at
today we're an AI
company but uh at
one point in our in
our history, we were
basically uh we were
a chip known as a
chip company. We're
the only company
that if you look at
the semiconductor
space that actually
creates its own
foundation models.
Having that knowwhow
in house, having
this open
that allows us to
share information,
allows us to see
where the trends are
going, and then
teams can come
together very
quickly and focus on
what is next on the
road map. We are
extremely
transparent about
what we're doing
next. You see our
road maps there,
public information.
We show them in our
keynotes over a year
in advance. You know
what we're thinking
of. But when it
comes down to the
details of that
design and how do
you optimize so that
the entire workflow
is completely
seamless, the reason
Nvidia is able to
make these rapid
pivots is just the
culture of the
company. All these
teams are working in
concert with each
other and that's why
we can make these
changes on such a
rapid manner when we
see shifts in the
market. Your
explanation and so
great.
지금까지 에러였습니다. 입니다.

엔비디아 독점 인터뷰 | 블랙웰 GPU 핵심 설계 리더가 직접 밝힌 AI 팩토리와 반도체 시장의 미래

요약

핵심 포인트

댓글