Engineer builds 18B frankenmerge that outperforms 35B Qwen
Kyle Hessling merged Claude Opus, GLM‑5.1 and Qwen 3.5 finetunes into an 18B model that, after a QLoRA heal, passed 40 of 44 tests and used 9.2 GB VRAM.
Kyle Hessling, an AI infrastructure engineer, combined finetunes of Claude Opus, GLM‑5.1 and Qwen 3.5 into an 18‑billion‑parameter model. After a targeted QLoRA “heal” fine‑tune, the merged model passed 40 of 44 capability tests and outperformed Alibaba’s Qwen 3.6‑35B‑A3B MoE in the reported comparison while running in 9.2 GB of VRAM.
Hessling built the model by stacking layers from two Qwen 3.5 finetunes. Layers 0–31 come from Qwopus 3.5‑9B‑v3.5, a distillation of Claude 4.6 Opus into Qwen. Layers 32–63 come from Qwen 3.5‑9B‑GLM5.1‑Distill‑v1, trained with reasoning signals from GLM‑5.1. He used a passthrough frankenmerge method that places one model’s layers on top of another without blending or averaging weights.
Existing merge tools did not support Qwen 3.5’s hybrid linear/full attention architecture, so Hessling wrote a custom merge script. The initial raw merge produced garbled code output where the layer boundary disrupted attention and projection behavior. To correct that, he ran a heal fine‑tune implemented as a QLoRA intervention that adds a small set of learned parameters and targets attention and projection layers. After roughly 1,000 heal steps the model stabilized and reached the reported benchmark scores.
The merged model was quantized in Q4_K_M format and requires about 9.2 GB of VRAM for inference. Hessling noted an NVIDIA RTX 3060 is theoretically sufficient to run the model in that quantization. By comparison, the Qwen 3.6‑35B MoE evaluated in the same comparison requires about 22 GB of VRAM.
User trials on consumer machines showed mixed results. On an M1 MacBook running an MLX quantized build, a zero‑shot prompt produced a long internal reasoning chain that hit token limits without producing a final working result. A simple prompt to write a Snake game took more than 40 minutes in one test run. The stacked finetunes include two reasoning‑focused distills; testing found that some prompts produced extended internal chains of thought.
Hessling published the merge scripts, training notes and model files in a public repository. The original finetune author mirrored the repository, and the combined release recorded more than three thousand downloads within two weeks of availability.
The project used model distillation, targeted layer stacking and aggressive quantization to reduce memory requirements while retaining reasoning behavior. The code, model weights and heal recipe are available for other developers to run and modify.
The material on GNcrypto is intended solely for informational use and must not be regarded as financial advice. We make every effort to keep the content accurate and current, but we cannot warrant its precision, completeness, or reliability. GNcrypto does not take responsibility for any mistakes, omissions, or financial losses resulting from reliance on this information. Any actions you take based on this content are done at your own risk. Always conduct independent research and seek guidance from a qualified specialist. For further details, please review our Terms, Privacy Policy and Disclaimers.








