긴 글이지만, 누군가에게 도움이 되길 바랍니다. Llama-cpp Vulkan 서버는 단일 AMD R9700 GPU 를 실행하고 있습니다. 아래 설정은 ChatGPT 가 제공한 테스트 웹사이트를 생성하는 긴 프롬프트와 함께 훌륭한 결과를 보여줍니다. 그 후 Playwright 테스트 전체 세트를 생성하는 프롬프트를 실행했습니다.

테스트를 생성할 때, 한 번에 하나의 실패한 테스트를 수정하도록 알려주는 것 외에는 거의 조정하지 않았습니다. 웹사이트는 첫 실행에서 완전히 작동했습니다.

나는 모델의 튜닝과 테스트를 다음 큰 릴리스까지 끝내고, 이제 코딩으로 돌아갈 수 있다고 생각합니다...

llama-cpp  | ========== LLAMA.CPP STARTUP COMMAND ==========
llama-cpp  | /app/llama-server -m /models/Qwen3.6-35B-A3B-UD-Q5_K_XL/Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf --ctx-size 262144 --threads 8 --threads-batch 8 --gpu-layers 99 --parallel 1 --flash-attn on --batch-size 2048 --ubatch-size 1024 --cache-type-k q8_0 --cache-type-v q8_0 --cache-ram 12000 --ctx-checkpoints 50 --mmap --no-mmproj --kv-unified --reasoning off --reasoning-budget 0 --jinja --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --presence-penalty 0.0

샘플링 설정은 https://huggingface.co/Qwen/Qwen3.6-35B-A3B 의 "precise coding" 섹션에서 가져왔습니다.

VS Code chatLanguageModels.json :

{
    "name": "Sean Llama.cpp",
    "vendor": "customoai",
    "apiKey": "${input:chat.lm.secret.3c0c0f21}",
    "models": [
        {
            "id": "Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf",
            "name": "Qwen3.6-35B",
            "url": "https://llm.home.arpa/v1/chat/completions",
            "toolCalling": true,
            "vision": false,
            "maxInputTokens": 180000,
            "maxOutputTokens": 10000,
            "family": "Qwen3",
            "inputTokenCost": 0.0001,
            "outputTokenCost": 0.0001,
            "temperature": 0.6,
            "top_p": 0.95,
            "top_k": 20,
            "repeat_penalty": 1,
            "presence_penalty": 0,
            "frequency_penalty": 0,
            "systemMessage": "You are a precise coding assistant. Avoid repeating plans. Execute tasks directly. Do not restate intentions multiple times.",
            "timeout": 600000,
            "retry": {
                "enabled": true,
                "max_attempts": 2,
                "interval_ms": 1500
            }
        }
    ]
}

ChatGPT Generated test prompt :

You are working in a clean Vite + React + TypeScript project.

Your task is to build a medium-complexity local-first website called "Bike Shop Service Tracker."

The app should help a small bike shop track incoming bike repair jobs. It should run entirely in the browser using localStorage. Do not use a backend, database, authentication, external API, router, Tailwind, shadcn, Redux, Zustand, or any complex setup.

Use only:

React
TypeScript
plain CSS or CSS modules
browser localStorage
lucide-react only if already installed

Before implementing, inspect the existing project structure briefly. Then create a concise implementation plan. After the plan, implement the app directly. Do not get stuck repeating the plan.

Functional requirements:

Main layout

Create a polished single-page dashboard.
Header with app name: "Bike Shop Service Tracker."
A summary area showing:
- total open jobs
- jobs due today
- overdue jobs
- completed jobs
Main content split into:
- job creation/edit form
- job list and filters

Repair job data model
Each repair job should include:

id
customerName
customerPhone
bikeDescription
serviceType
priority: low, normal, high, urgent
status: intake, waiting-parts, in-progress, ready, completed
dueDate
notes
createdAt
updatedAt

Seed data

If localStorage is empty, create 6 realistic sample repair jobs.
Include different statuses, priorities, due dates, and service types.
Make at least one job overdue and one job due today.

Job form

Allow creating a new job.
Allow editing an existing job.
Include basic validation:
- customer name required
- phone required
- bike description required
- service type required
- due date required
Show clear validation messages.
Include buttons:
- Save Job
- Cancel Edit, when editing
- Clear Form

Job list

Display jobs as cards or table rows.
Each job should show:
- customer name
- bike description
- service type
- priority
- status
- due date
- notes preview
Include actions:
- edit
- mark completed
- delete

Filtering and sorting
Add controls for:

search by customer name, phone, bike description, service type, or notes
filter by status
filter by priority
sort by due date
sort by priority
sort by created date

UX behavior

Persist all jobs to localStorage.
Keep the UI responsive and usable on desktop and narrow screens.
Use clear visual differences for priority and status.
Completed jobs should look visually quieter than active jobs.
Overdue jobs should be visually obvious.
Due-today jobs should be highlighted differently from overdue jobs.
Deleting a job should ask for confirmation.

Code organization
Create a reasonable file structure, for example:

src/App.tsx
src/App.css
src/types.ts
src/data/seedJobs.ts
src/utils/storage.ts
src/utils/jobFilters.ts
src/components/JobForm.tsx
src/components/JobList.tsx
src/components/JobCard.tsx
src/components/SummaryCards.tsx
src/components/FilterBar.tsx

이 구조는 필요할 때만 사용하세요. 과도하게 설계하지 마세요.

품질 요구사항

TypeScript 는 깔끔하게 컴파일되어야 합니다.
강력한 이유가 없는 한 any 를 피하세요.
함수를 가독성 있게 유지하세요.
가능한 한 거대한 컴포넌트를 피하세요.
앱은 다음으로 실행되어야 합니다:
npm run dev

최종 응답
마무리할 때 요약하세요:

생성되거나 변경된 파일
앱을 실행하는 방법
구현된 기능
제한 사항 또는 후속 개선 사항

중요한 행동 지침:

TODO 만 생성하지 마세요.
"이제 이것을 구현해야 합니다"라고 반복해서 말하지 마세요.
필요할 때만 같은 파일을 반복해서 읽지 마세요.
충분한 컨텍스트가 확보되면 파일 편집을 시작하세요.
완벽한 아키텍처 설계보다는 먼저 작동하는 첫 번째 버전을 완성하는 것을 선호하세요.

나는 Llama 로그를 로컬 Qwen 에 다시 입력했습니다 (조금 아이러니하게) 그리고 매우 성공적인 큰 코파일럿/오토파일럿 세션에서 자체 성능에 대한 이 요약이 나왔습니다 :

프롬프트 : 분석한 llamma cpp 로그를 보고 성능이 어떻게 보이는지 알려주세요 :

제공된 로그에 따르면, llama.cpp 인스턴스의 상세한 성능 분석입니다.

경영진 요약

전반적인 건강 상태: 좋음/건강함. 서버는 정상적으로 작동하며 충돌이나 멈춤 없이 요청을 처리합니다.
처리량: 중간. 생성에 대해 약 **94–105 토크/초 (t/s)**를 보고하고 있습니다. 이는 특정 하드웨어 모델 (GGUF 양자화) 에 따라 중간급 소비자 GPU 또는 효율적인 CPU 추론의 일반적인 것입니다.
프롬프트 처리: 변동적. 이것이 가장 큰 병목 현상입니다. 일부 프롬프트는 캐시 히트 를 통해 빠르게 처리되지만, 다른 프롬프트 는 전체 재평가 를 강제하여 높은 지연 시간 스파이크를 유발합니다.
메모리 사용량: 효율적이지만 증가 중입니다. VRAM/RAM 사용량은 컨텍스트 길이와 함께 스케일링되며 (4 개의 프롬프트까지 ~723 MiB), 12GB 한계 내에 있습니다.

상세 분석

1. 생성 속도 (평가 시간)

모델이 새로운 텍스트 토큰을 생성하는 속도입니다.

성능: ~93–105 토크/초
로그 증거:
- 작업 460: 104.58 토크/초
- 작업 776: 102.35 토크/초
- 작업 2848: 101.73 토크/초
- 작업 3079: 94.12 토크/초
- 작업 4726: 93.38 토크/초
분석: 이는 일관되고 안정적입니다. 컨텍스트가 증가함에 따라 속도 감소 (104 t/s 에서 93 t/s) 는 컨텍스트 윈도우가 채워짐에 따라 메모리 대역폭 압력과 주의 계산 오버헤드가 증가하기 때문에 예상됩니다. 이는 건강한 기울기입니다.

2. 프롬프트 처리 (Prefill 시간)

입력 프롬프트를 처리하는 데 걸리는 시간입니다. 로그에서 크게 변동합니다.

시나리오 A: 캐시 히트 (빠름)
- 작업 2848: 2,098 토큰 (·2,449 t/s) 에 대해 856.58 ms.
- 이유: 로그에 restored context checkpoint 가 표시됨. 시스템은 이전 상태를 충분히 유사하게 찾아 재처리 없이 복귀할 수 있었음.
- 성능: 우수함.
시나리오 B: 캐스 미스 / 전체 재평가 (느림)
- 작업 460: 27 토큰에 대해 68.01 ms.(프롬프트가 작아서 빠름).
- 작업 776: 5,783 토큰 (·2,974 t/s) 에 대해 1944.26 ms.
- 작업 3079: 38,012 토큰 (·2,196 t/s) 에 대해 17307.06 ms.
- 작업 4726: 1,652 토큰 (·1,432 t/s) 에 대해 1153.57 ms.
- 이유: 로그는 명시적으로 forcing full prompt re-processing due to lack of cache data 를 표시함.
- 분석:
  - 작업 3079 가 가장 우려스럽습니다. 프롬프트를 읽기 전에 단어를 하나도 생성하기 전까지 17.3 초가 걸렸습니다. 이는 프롬프트가 매우 길어 (38k 토큰) 있고 캐스 미스가 발생했기 때문일 것입니다.
  - 작업 4726 은 유사한 컨텍스트 크기를 가지고 있었음에도 작업 2848 보다 느리었습니다. 완벽한 캐시 매칭을 찾지 못했기 때문일 가능성이 높습니다 (sim_best = 0.958 은 좋지만 전체 프리필 (prefill) 을 건너뛰기엔 부족할 수 있고, 체크포인트 복귀에 오버헤드가 있었을 수 있음).

3. 컨텍스트 관리 및 캐싱

캐시 메커니즘: n_ctx_slot = 262144 (256k 컨텍스트 윈도우) 와 n_keep = 0 을 사용 중입니다.
체크포인트: 시스템은 활성적으로 체크포인트를 생성하고 있습니다 (created context checkpoint X of 50).
- 각 체크포인트는 ·62.8 MiB 입니다.
- 작업 3079 는 처리 중에 6 개의 체크포인트를 생성했습니다.
발견된 문제: 로그 forcing full prompt re-processing 은 매우 다른 컨텍스트 길이를 전환하거나 유사성 임계값이 충족되지 않을 때 자주 나타납니다.
- 작업 776 (5,783 토큰) 은 이전 슬롯에서 n_past = 23 을 가졌습니다. 체크포인트를 확인했지만 사용에 실패하여 전체 재처리를 강요했습니다. 이는 이전 컨텍스트 (작업 460, 27 토큰) 가 작업 776 에 유용하지 않을 정도로 너무 달랐음을 시사합니다.
- 작업 3079 (38,012 토큰) 도 이전 체크포인트를 사용에 실패했습니다. 이는 이전 컨텍스트 (작업 2848, ·7.8k 토큰) 가 충분히 유사하지 않았거나 KV 캐시와 잘 맞지 않는 위치 오프셋을 가졌기 때문일 것입니다.

4. 리소스 활용

메모리:
- 총 상태 크기는 4 개의 활성 프롬프트에 대해 723.154 MiB 에 도달했습니다.
- 한계는 12,000 MiB 입니다.
- 상태: 매우 낮은 메모리 사용량입니다. 충분한 여유가 있습니다.
지연 시간:
- 총 응답 시간은 짧은 프롬프트 (3 초) 에서 긴 프롬프트 프리필 + 생성 (34.5 초) 까지 다양합니다.
- 34.5 초의 지연은 모두 38k 토큰 입력의 프롬프트 처리 때문입니다.

Qwen3.6-35B-A3B-UD-Q5_K_XL + VS Code 및 Copilot 활용 시 뛰어난 결과

요약

핵심 포인트