智库文档所有分类

南洋理工大学等DragNeXt：重新思考基于拖拽的图像编辑.pdf

下载

用户#501362

22页 | 19.20MB | 0次下载 |

0.0

(0人评价)

我要评价：

投诉举报

用手机看文档

扫一扫,手机看文档

下载

开通VIP

DragNeXt: Rethinking Drag-Based Image Editing Yuan Zhou1, Junbao Zhou1, Qingshan Xu1*, Kesen Zhao1, Yuxuan Wang1, Hao Fei2, Richang Hong3, Hanwang Zhang1 1Nanyang Technological University 2National University of Singapore 3Hefei University of Technology Abstract Drag-Based Image Editing (DBIE), which allows users to ma- nipulate images by directly dragging objects within them, has recently attracted much attention from the community. How- ever, it faces two key challenges: (i) point-based drag is often highly ambiguous and difficult to align with user intentions; (ii) current DBIE methods primarily rely on alternating be- tween motion supervision and point tracking, which is not only cumbersome but also fails to produce high-quality re- sults. These limitations motivate us to explore DBIE from a new perspective—unifying it as a Latent Region Optimiza- tion (LRO) problem that aims to use region-level geometric transformations to optimize latent code to realize drag manip- ulation. Thus, by specifying the areas and types of geomet- ric transformations, we can effectively address the ambiguity issue. We also propose a simple yet effective editing frame- work, dubbed DragNeXt. It solves LRO through Progressive Backward Self-Intervention (PBSI), simplifying the overall procedure of the alternating workflow while further enhancing quality by fully leveraging region-level structure information and progressive guidance from intermediate drag states. We validate DragNeXt on our NextBench, and extensive experi- ments demonstrate that our proposed method can significantly outperform existing approaches. Code — Extended version — Introduction Diffusion models (Rombach et al. 2022; Dhariwal and Nichol 2021) have made remarkable progress in the field of text-to- image generation, serving as foundational models for a wide range of generative tasks, such as image super-resolution (Wu et al. 2024; Sun et al. 2024), style transfer (Zhang et al. 2023; Chung, Hyun, and Heo 2024), text-based image edit- ing (Brooks, Holynski, and Efros 2023; Hertz et al. 2022), few-shot learning (Zhou et al. 2024; Giannone, Nielsen, and Winther 2022; Tan, Chen, and Yan 2023). Nevertheless, an inherent limitation of diffusion models lies in their poor con- trollability, which brings more challenges to fine-grained editing tasks, especially those that require interactive-style *Corresponding author. Copyright © 2026, Association for the Advancement of Artificial Intelligence (). All rights reserved. Move the hand holding a metallic- copper handbell. DragNeXt ClipDrag RegionDrag Rotate the cat’s head around its left cheek as the central point. Drag inputUser intentions FastDrag Move the cup to the left and do not change its shape. Left cheek 41s 2s 8s27s 35s 2s 6s27s 25s 66s 2s 8s 25s(a) (b) (c) Figure 1: Examples of the key issues in current DBIE. (i) Text prompts used in ClipDrag (Jiang, Wang, and Chen 2025) remain insufficient for solving the ambiguity issue; (ii) prede- fined mapping functions employed by FastDrag (Zhao et al. 2024) and RegionDrag (Lu, Li, and Han 2024) boost effi- ciency but severely compromise editing quality. The numbers given in the upper-left corner of images indicate the latency for dragging the regions of handle points to target positions. manipulation (Liu et al. 2024; Zhao et al. 2025; Zhou et al. 2025a). Recent studies (Shi et al. 2024; Cui et al. 2024) have ex- plored the use of diffusion models for Drag-Based Image Editing (DBIE), which enables users to manipulate images by directly dragging objects via a set of user-specified handle and target points. Existing diffusion-based DBIE methods predominantly employ a point-based alternating optimization strategy (Liu et al. 2024; Cui et al. 2024; Ling et al. 2024; Hou et al. 2024; Choi et al. 2024; Shi et al. 2024), where Step-1: optimizing the features of handle points toward corre- sponding target positions by performing point motion super- vision; Step-2: updating handle point positions iteratively via KNN-based point tracking. However, the point-based alternating workflow inevitably brings two issues to DBIE: (i) point-based drag suffers from high ambiguity and struggles to align with users’ intentions, thereby severely compromising the precision of the editing process; (ii) tackling DBIE through an alternating procedure of motion supervision and point tracking is not only cumber- some but also fails to always yield high-quality results, as accurately estimating the updated positions of handle points in each drag iteration is both challenging and time-consuming ar X iv :2 50 6. 07 61 1v 2 [ cs .C V ] 2 5 Ja n 20 26 (Ling et al. 2024; Cui et al. 2024). Besides, given that point- based motion supervision offers only limited structural cues about visual scenes, it cannot effectively guide DBIE. Recently, ClipDrag (Jiang, Wang, and Chen 2025) sought to mitigate ambiguity by incorporating constraints from text prompts. Nonetheless, as a form of high-level descriptions, texts are often too vague to provide control signals required by fine-grained image manipulation (Shi et al. 2024; Zhang, Rao, and Agrawala 2023). For example, as shown in Figure 1, even with the guidance of the prompt “rotate the cat’s head around its left cheek as the central point”, ClipDrag still fails to achieve the desired outcome. To boost DBIE’s efficiency, FastDrag (Zhao et al. 2024) and RegionDrag (Lu, Li, and Han 2024) proposed using predefined mapping functions, rather than the learnable alternating paradigm. However, unfortu- nately, the warpage function and the copy-and-paste strategy used in FastDrag and RegionDrag are not flexible enough to handle all editing tasks and are prone to yielding unreal- istic or unnatural results—such as the distorted cat’s head, deformed handbell, and visible artifacts in the edited areas shown in Figure 1—thus severely degrading image quality. Point-based motion supervision and tracking are cum- bersome and often difficult to align with users’ intentions, whereas relying solely on the warpage function or the copy- and-paste strategy is far from delivering high-quality results. These observations naturally lead us to ask two questions: 7→Q1. Is there a more effective solution to solve the ambi- guity issue? 7→Q2. How to enhance the efficiency of DBIE approaches based on alternating point motion supervision and tracking, while further improving their editing quality? These two questions motivate us to revisit DBIE from a new perspective—unifying it as a Latent Region Optimiza- tion (LRO) problem, which aims to leverage region-level geometric transformations to optimize latent embeddings and realize drag manipulation. Therefore, by specifying the regions and types of geometric transformations, we can ef- fectively address Q1. Furthermore, we design a simple-yet- effective editing framework, DragNeXt, to tackle Q2. For efficiency, it unifies DBIE as LRO and thus eliminates the necessity of conducting handle point tracking by upgrad- ing point motion supervision to region-level optimization of latent embeddings. For editing quality, we propose a Progres- sive Backward Self-Intervention (PBSI) strategy that solves LRO by fully leveraging region-level self-intervention from intermediate drag states. By bypassing point tracking and considering region-level guidance from intermediate states, it can achieve a better trade-off between efficiency and quality. Contribution Summary: (i) We propose to unify DBIE as an LRO problem. Therefore, by specifying the regions and types of geometric transformations, we can effectively resolve the ambiguity issue. (ii) We propose a simple yet effective editing framework, DragNeXt, which tackles DBIE via LRO and further enhancing editing quality via performing PBSI. (iii) We introduce NextBench, a benchmark with explicit user- intention annotations for evaluating alignment between user expectations and edited results. (iv) On NextBench, the exten- sive experiments demonstrate that our DragNeXt can achieve a better trade-off between editing efficiency and quality. Related Work DragDiffusion (Shi et al. 2024) is the first work using dif- fusion models to achieve DBIE, which followed (Pan et al. 2023) and conducted motion supervision and point tracking alternately. Based on (Shi et al. 2024), GoodDrag (Zhang et al. 2024) further enhanced the fidelity of dragged areas by rearranging the drag process across multiple denoising timesteps. DragText (Choi et al. 2024) proposed refining text embeddings to avoid drag halting. DragonDiffusion (Mou et al. 2023) and DiffEitor (Mou et al. 2024) discarded the tracking phase and directly applied point motion supervision between initial handle points and target points. To estimate handle point positions more accurately, StableDrag (Cui et al. 2024) proposed a discriminative point tracking strategy, and FreeDrag (Ling et al. 2024) designed a line search back- tracking mechanism. EasyDrag (Hou et al. 2024) advanced (Shi et al. 2024) via introducing a stable motion supervision, which is beneficial for improving the quality of final results. FastDrag (Zhao et al. 2024) and RegionDrag (Lu, Li, and Han 2024) improved the efficiency of DBIE by employing fixed predefined mapping functions, where (Lu, Li, and Han 2024) is based on copy-and-paste and thus requires users to specify both handle and target areas. ClipDrag (Jiang, Wang, and Chen 2025) reduced the ambiguity of DBIE via using text prompts. DragNoise (Liu et al. 2024) proposed editing on UNet’s bottleneck features, which inherently contain more semantic information and can stabilize dragging. REMARK 1. Our method differs from DragDiffusion, Good- Drag, DragText, StableDrag, FreeDrag, EasyDrag, and Drag- Noise fundamentally, since it does not rely on alternating between point motion supervision and tracking. Instead of only considering initial relationships between handle and tar- get points, as in DragonDiffusion and DiffEditor, we fully leverage progressive region-level guidance from intermediate drag states. CLIPDrag and RegionDrag overlook the ambigu- ity issue arising from the type of geometric transformations, whereas our approach does not rely on texts to reduce ambi- guity. Different from FastDrag and RegionDrag, we only use geometric mapping functions to provide interventional signal and our learnable backward self-intervention strategy can fully leverage inherent prior knowledge learned by diffusion models via back-propagated gradients, avoiding unnatural deformation led by using a fixed transformation pattern. Methodology Preliminaries Diffusion Models. Diffusion models (Ho, Jain, and Abbeel 2020; Rombach et al. 2022; Dhariwal and Nichol 2021) are composed of a diffusion process and a reverse process. During the diffusion, an image x is encoded into latent space z0 and undergoes a gradual addition of Gaussian noise, q(zt|z0) = N ( √ αtz0, (1 − αt)I), where αt is a non-learnable parameter and decreases . the timestep t. The reverse process is to recover z0 from zT by training a denoiser εΘ(·): LΘ = Et∼U(1,T ),εt∼N (0,I) [ ||εt − εΘ(zt; t, c)||2 ] (1) where εt denotes the groundtruth noise in the timestep t, and c represents an extra condition. Following the prior works (Liu et al. 2024; Shi et al. 2024; Zhang et al. 2024), we em- ploy DDIM (Song, Meng, and Ermon 2020) in our approach. Drag-Based Image Editing. Given n pairs of handle and target points O = {hi = (xhi , y h i ), gi = (x g i , y g i )}i=1,...,n, DBIE aims to edit an image x by dragging objects or regions indicated by handle points to target ones. Usually, an extra binary mask M is used to specify the uneditable region of x. Motion Supervision and Point Tracking. Current DBIE methods (Shi et al. 2024; Zhang et al. 2024; Choi et al. 2024; Cui et al. 2024; Ling et al. 2024; Hou et al. 2024; Liu et al. 2024) mainly rely on performing motion supervision and point tracking alternately, where the former aims to transfer the features of handle points to target positions while the latter updates handle points iteratively and prevents dragging halt. We use Fhi/gi(zt) to denote the features extracted by εΘ(·) at the location hi or gi. Therefore, the objective function of the motion supervision can be described by Equation (2): Lm(zkt ) = n∑ i=1 ∑ q∈π(hk i ) ||Fq+di(z k t )−SG(Fq(z k t ))||1+RM (2) where zkt and h k i denote the latent code zt and the handle point hi updated by k iterations, di = (gi−hki )/||gi−h k i ||2 is the normalized vector from hki to gi, π(h k i ) denotes the neighborhood of hki , SG(·) stops gradients from being back- propagated to variables, and RM is a constraint term to ensure the consistency of uneditable regions. After the motion supervision in each iteration k, point tracking is performed: hk+1i = argmin q∈π(hk i ) ||Fq(zk+1t )−Fhi(zt)||1 (3) where Fhi(zt) indicates the features of the initial handle point hi in the original latent code zt. REMARK 2. 7→Why is point tracking critical for motion- based methods? Point-based motion supervision is too local to provide enough guidance for the whole editing procedure. Losing the positions of handle points will severely interrupt the drag process, as no alternative guidance for editing re- mains, thereby significantly damaging the quality of edited images. 7→What are limitations of methods based on mo- tion supervision and point tracking? Firstly, although the use of point tracking can alleviate the inherent limitation of point-based motion supervision, it is still very challenging to precisely estimate the updated positions of handle points. Inaccurate coordinate estimation can significantly mislead the dragging process, resulting in suboptimal outcomes. Sec- ondly, the alternating execution of Equation (2) and (3) results in low efficiency of DBIE, since motion supervision is re- peatedly disrupted by iterative point tracking. Thirdly, point motion supervision suffers from high ambiguity and easily leads to gaps between user expectations and actual results. Last but not least, while using dense points can reduce ambi- guity in some situations, this will substantially decrease the efficiency of the alternating workflow, as shown in Figure 9. Latent Region Optimization for Reliable Drag-Based Image Editing We begin by outlining key factors behind the ambiguity issue of DBIE in Proposition 1, which can be further summarized (a) Factor-1: translation or deformation? (b) Factor-2: move the nose or head? How to drag? What to drag? Drag input Result-2Result-1 Drag input Result-1 Result-2 deformation translation move nose move head Figure 2: Factor-1 and -2. (a) Handle Factor-1: Clarify the drag type. (b) Handle Factor-2:Specify the drag region. Original Rethink-1 Rethink-2 Original Rethink-1 Rethink-2 How to drag? What to drag? deformation translation nose head I know! I know! I know! I know! Figure 3: Rethink DBIE. as two key questions: how to drag? and what to drag? Proposition 1 (Key Factors to Ambiguity). The ambiguity of DBIE is twofold: 7→Factor-1. drag operations inherently involve multiple types—such as translation, deformation, and rotation—and treating them as type-agnostic induces ambigu- ity about users’ intentions (how to drag?); 7→Factor-2. point indicators are insufficient for accurately specifying objects or regions that need to be dragged (what to drag?). In Figure 2, we provide an illustration for the two key factors, Factor-1 and Factor-2. On one hand, the drag opera- tion in Figure 2(a) is inherently ambiguous since it could be interpreted as either a translational movement of the cup or a deformation of its edge region. This ambiguity stems from uncertainty about the types of drag operations (how to drag?), which inevitably increases gaps between user expectations and model behaviors, thus damaging the precision of the editing process. On the other hand, in Figure 2 (b), the drag instruction could be either dragging the raccoon’s nose, its head, or even its whole body. This type of ambiguity arises from uncertainty about which areas or objects to drag (what to drag?) since points are too ambiguous to clearly reflect users’ intentions. How to drag and what to drag are two fun- damental problems in DBIE. Although textual description appears to be a shortcut, it actually does not work well as exemplified in Figure 1. We argue there is No Free Lunch in resolving these ambiguity issues, which means it is necessary to enable models to perceive drag operation types and areas in a more explicit way and design a more effective approach to guide them toward producing user-intended results. REMARK 3. 7→Some previous methods have noticed the ambiguity issue in DBIE, but few of them consider both Factor-1 and Factor-2, or provide a systematic and clear analysis for this problem, which we believe is critical and valuable for inspiring the further development of DBIE. Based on the above observations, we step-by-step intro- duce how to explore DBIE from a new perspective, ., unifying DBIE as a Latent Region Optimization (LRO) problem. We first rethink the DBIE task in Proposition 2. Proposition 2 (Rethink DBIE). DBIE can be regarded as performing geometric transformations on user-specified re- gions of images. For instance, Result-1 given in Figure 2 (a) can be seens as applying a deformation transformation to the white coffee Target region 𝝆𝑗 Handle region 𝝑𝑗 Target region 𝝆𝑖 Handle region 𝝑𝑖 (a) Rotation (b) Translation Translation matrix handle point 𝐡i central point 𝐜i handle point 𝐡i target point 𝐠i Figure 4: Examples of estimating target regions. cup region. Form this perspective, by specifying the regions and types of geometric transformations, we can effectively resolve the ambiguity issue caused by Factor-1 and Factor-2, because of clarifying both how and what to drag as illustrated in Figure 3. Based on Proposition 2, we further give the definition of our region-level Reliable Drag-based Image Editing (Reliable DBIE) in Definition 1, which aims to help users to yield reliable editing results and narrow gaps between their expectations and actual outcomes. Definition 1 (Reliable DBIE). Reliable DBIE is to manip- ulate user-specified regions E = {ϑi}i=1,...,n of an image x based on the corresponding geometric transformations Γ = {fi}i=1,...,n inferred from instructions given by users. Currently, the editing process is primarily performed on noise latent embeddings encoded by diffusion models, as they are more editable than original images (Mokady et al. 2023; Ruiz et al. 2023). Thus, we can further extend Definition 1 to Def- inition 2, and unify DBIE as a Latent Region Optimization (LRO) problem. Definition 2 (Unify DBIE as LRO). DBIE can be unified as optimizing specific target regions P = {ρi}i=1,...,n within a latent code zt based on user-specified handle regions E = {ϑi}i=1,...,n and the corresponding geometric transfor- mations Γ = {fi}i=1,...,n involved in user instructions: z∗t = argmin zt LLRO (zt, {ρi}i=1,...,n) (4) ., {ρi}i=1,...,n = δ(E,Γ) where LLRO is the objective function of LRO, and δ(·) aims to produce binary masks {ρi}i=1,...,n to identify target re- gions required to be optimized in zt according to E and Γ. REMARK 4. 7→Why LRO? LRO serves as a bridge between DBIE and region-level geometric transformations. Therefore, we can leverage many well-studied geometric transformation functions in computer graphics to realize more reliable DBIE via explicitly controlling the dragging process. 7→What can LRO do? Different from methods based on alternating point motion supervision and tracking, LRO takes into account region-level visual information, which provides more robust guidance for latent code manipulation. Under such regional supervision, it is unnecessary to excessively focus on posi- tions of some specific points, as there exists sufficient context information to guide dragging. 7→How to estimate target latent regions? Target latent regions are estimated using geo- metric transformation functions widely adopted in computer graphics. As exemplified in Figure 4, if users intend to rotate the handle region ϑi, the region can be multiplied by a rota- tion matrix to achieve the desired geometric transformation. We observe that rotation and translation geometric trans- formations can cover most DBIE scenarios. We reformu- late DBIE’s user input: users specify a set of handle regions E = {ϑi}i=1,...,n for an input image and give the correspond- ing drag instructions C = {Ti,Oi}i=1,...,n. If the operation type Ti =“rotation”, Oi = {hi, gi, ci} where hi and gi denote a pair of a handle point and a target point, and ci represents a rotation center of ϑi; otherwise, Oi = {hi, gi}. Also, a binary mask M is adopted to specify the uneditable region. Based on {hi, gi} or {hi, gi, ci}, the transformation function fi can be constructed by determining the correspond- ing rotation and translation matrix. For details on converting input points to fi, please refer to the supplementary material. REMARK 5. DBIE via regional geometric transforma- tions. Object movement can be achieved by translating an object’ entire region; deformation can be realized by trans- lating only its edge region; 2D rotation can be achieved by applying a rotation transformation; and 3D rotation can be interpreted as translating the sub-region of an object, assisted by prior knowledge inherently learned in diffusion models (as shown in Figure 7 (j), the car’s 3D rotation can be realized by translating its front to the right). Explicit geometric functions and region-level guidance can help achieve better DBIE. Progressive Backward Self-Intervention: Less Meets More! Based on Definition 2, we further design DragNeXt to en- hance both editing quality and efficiency. As mentioned be- fore, the alternating workflow lowers the efficiency of DBIE, while inaccurate handle point tracking easily leads to drag- ging halt and makes results unsatisfactory. Therefore, Drag- NeXt addresses DBIE from an LRO perspective, eliminating the need for KNN-based point tracking by explicitly advanc- ing point-based motion supervision to region-level optimiza- tion of latent embeddings. Moreover, it employs a Progressive Backward Self-Intervention (PBSI) strategy, which does not require accurately tracking point positions but still achieves superior editing results by fully leveraging progressive region- level guidance from intermediate transformation states. Progressive Backward Self-Intervention. Figure 5 gives a brief illustration of our approach. Given an input image x, we first encode it into latent space and perform DDIM inversion to produce zT . Then, PBSI is conducted from T to T ′ during denoising with K iterations per timestep. We take the handle region ϑi at the k-th iteration of the timestep t as an example to illustrate PBSI. We first extract the features of zkt by concatenating outputs from the last upsample blocks of all stages of εΘ(zkt ) and upsampling them to half of the resolution of x, denoted as F(zkt ). Then, we estimate the intermediate transformation state ρt,ki for the handle region ϑi within the extracted features F(zkt ) based on user-given conditions C, which can be described by Equation (5): ρ t,k i ,Πϑi→ρt,ki = δ(ϑi, f t,k i ) (5) ., δ(ϑi, f t,k i ) = { Rot(ϑi, ci, θ), if Ti =“rotation” Trans(ϑi,ω), else. V A E E n co d e r 𝒛𝑇 Input image … 𝒛𝑡 V A E D e co d e r Output image … *Note: for brevity, the drag instruction is visualized in 𝒙. 𝒄𝑖 𝒈𝑖 𝒉𝑖 𝝑𝑖 𝐾 iterations In v e rs io n 𝒛0 If 𝑇′ < 𝑡 < 𝑇, execute PBSI; otherwise, plain denoising! 𝒛𝑡 𝑘 UNet Upsample ℱ ⋅ Extract features () Self-intervention Intermediate LRO b a c k w a rd 𝒛𝑡 𝑘+1 𝝑𝑖 𝝆𝑖 𝑘 Progressive Backward Self-Intervention − 𝜕ℒ𝐿𝑅𝑂 𝜕𝒛𝑡 𝑘 UNet 𝑘-th iteration Rotation Translation DeformationType if k=0 Concat Figure 5: A brief illustration of our DragNeXt. In the equation, Rot(ϑi, ci, θ) aims to rotate the handle region ϑi around the center point ci by an angle θ = ηt,k ∗∠hicigi, Trans(ϑi,ω) translates ϑi according to the offset vector ω = ηt,k ∗ (gi − hi), and ηt,k = K∗(T−t)+k K∗(T−T ′+1) is a weighting factor that determines angles or offsets of intermediate states. Also, ρt,ki is a binary mask that identi- fies the target intermediate region in F(zkt ), and Πϑi→ρt,ki represents the coordinate mapping from the handle region ϑi to the intermediate state ρ t,k i . Finally, we copy and de- tach the features extracted from the original latent code, F ′(zt) = F(zt).(). Moreover, we inter- ventionally adjust the detached features according to the ob- tained coordinate mapping, F ′(zt)[Πϑi→ρt,ki ], thereby per- turbing the original latent representations and transforming the features of the handle region ϑi to the intermediate target position ρt,ki . We consider self-intervention from the per- turbed features to F(zkt ) and back-propagate interventional signal to the latent code zkt along the denoiser to update latent features. This can be depicted by Equation (6) and (7): zk+1t ←− z k t − ∂LLRO ∂zkt , (6) LLRO = ∥F(zkt )∗ρ t,k i −F ′(zt)[Πϑi→ρt,ki ]∗ρt,ki ∥1+RM . (7) Minimizing LLRO back-propagates self-intervention gradi- ents to the latent code, thus progressively dragging handle re- gions to target positions. Once PBSI is complete, we denoise zT ′ to z0 and decode it into image space. The pseudocode of our DragNeXt is provided in Algorithm 1 of the appendix. REMARK 6. 7→Why backward self-intervention? Our method also adopts geometric mapping functions. However, unlike RegionDrag and FastDrag, which directly use them to manipulate latent code, we instead leverage them to pro- vide interventional signal. By optimizing latent code through back-propagated gradients from the denoiser, our approach fully exploits the prior of pretrained diffusion models, thereby mitigating unnatural results caused by fixed mapping func- tions. 7→Difference between Equation (7) and (2). LLRO considers region-level guidance, whereas Lm performs point DragBench Drag100 Our NextBench *User expectation: N/A. Rotation or translation? Ambigous! *User expectation: Move the cat’ face down. Unambiguous Handle region Editable region Figure 6: Comparison between our NextBench and the previ- ous benchmarks, DragBench and Drag100. supervision and needs to iteratively track handle point posi- tions. 7→Discussion on Equation (5). We unify translation, deformation, and 3D rotation into a single mapping function, Trans(·), as deformation and 3D rotation can be interpreted by translating the partial regions of an object, assisted by priors inherently encoded in diffusion models, ., in Fig- ure 7 (j), the car’s 3D rotation can be achieved by translating its front to the right. We extend drag points to regional guid- ance to more clearly specify regions to move, deform, or rotate. Experiments We first introduce our NextBench and evaluation metrics, followed by the main results of our method and ablation stud- ies. We provide implementation details in the supplementary material due to the limited space of the paper’s main body. NextBench: a Benchmark for Reliable DBIE To better evaluate model performance on Reliable DBIE, we propose a new benchmark, NextBench, that comprises 234 test samples with drag operations including translation, 2D/3D rotation, and deformation. Each sample is clearly an- notated with user intentions to better assess how well model outputs align with user expectations, as shown in Figure 6. Why NextBench? Existing benchmarks, such as DragBench (Shi et al. 2024) and Drag100 (Zhang et al. 2024), still con- tain ambiguous drag instructions, ., as shown in Figure 6, the raccoon’s and dog’s heads could either be translated or rotated, and both would be considered as satisfactory results. ClipDrag RegDragDrag input FastDragDragDiff GoodDrag 27s 27s 38s 38s 2s 6s31s 34s 48s 32s 52s 62s 112s 63s 2s 2s 8s 6s User intentions Lower the height of the desk lamp. (b) Rotate the cat’s head around its left cheek as the central point. (c) Drag the root of the tree and widen the road. (f) DragNeXt DiffEditor 20s 21s 20s Please move the potted plant to the left. (a) 25s 32s 20s 45s 71s 8s2s Rotate the sculpture to the left around its nose. (e) 26s 27s 43s20s 7s51s 2s Drag the face to make it face straight ahead (i) FreeDrag DragNoise 25s 28s 27s 31s 44s 71s 2s 8s Rotate the raccoon’s head around its chin as the central point. (d) 21s 32s27s 27s 32s 32s 35s 20s 20s 20s 20s 32s 45s 44s Drag the car’s front to rotate it toward the right side. (j) Extend the bird's golden feathers. (g) Drag the cup to make it taller. (h) 42s 53s 42s 58s 72s 2s 6s 6s 7s 7s 2s 2s 2s 60s 30s 45s121s 38s 28s 28s61s 29s48s 33s71s 29s27s 28s45s 28s46s 29s108s Figure 7: Qualitative results achieved by our DragNeXt. Our NextBench explicitly annotates each sample with a clear user expectation, handle regions, and editable areas, therefore enabling a more reliable assessment of intention–result align- ment and regional consistency. NextBench also treats 3D and 2D rotations as two distinct operations, showing that current approaches struggle with 2D rotation and excel at yielding 3D rotation based on the prior of pretrained diffusion models. Evaluation Metrics Following prior work (Zhang et al. 2024; Shi et al. 2024), we use LPIPS and DAI to evaluate performance on NextBench, where the radius of DAI is set as 20. To better assess region- level DBIE, LPIPS is computed between original images and edited results in three parts: (i) LPIPSue for uneditable regions (a lower LPIPSue indicates better preservation of uneditable regions); (ii) LPIPSth for consistency between handle and target regions (a lower LPIPSth means handle regions are successfully dragged to target positions); (iii) LPIPShh for handle regions (Successful dragging handle regions to target positions should result in a higher LPIPShh between the handle regions of original input images and edited results, reflecting the change of visual content). Main results We compare our method with eight typical open-source DBIE methods: DragDiffusion, DiffEditor, DragNoise, FreeDrag, Method Lat↓ DAI↓ LPIPSue ↓ LPIPSth ↓ LPIPShh ↑ DragDiff 36s DiffEditor 24s DragNoise 34s FreeDrag 69s GoodDrag 51s ClipDrag 58s RegionDrag 3s FastDrag 8s DragNeXt 28s Table 1: Quantitative results on NextBench. “Lat” indicates mean latency per image for dragging handle point regions to target positions. ↑ / ↓ denotes higher/lower values are better. GoodDrag, ClipDrag, RegionDrag, and FastDrag. Qualitative Results. We present qualitative results of our method in Figure 7, from which we can make the follow- ing observations. By specifying the regions and the types of geometric transformations, DragNeXt achieves better align- ment between user expectations and results, ., as shown in Figure 7 (a), our models translate the potted plant left- ward without damaging its shape. Based on LRO and PBSI, DragNeXt achieves a better trade-off between efficiency and quality, ., it has obviously higher efficiency than methods based on alternating point motion supervision and tracking, while delivering obviously higher editing quality than those relying solely on predefined mapping functions. 84% 3% 4% 5%4% DargNeXt ClipDrag RegionDrag FastDrag None of the results is satisfying Figure 8: Voting results. 25s 27s 38s Move the hand holding a metallic-copper handbell. Rotate the cat’s head around its left cheek as the central point. Lower the height of the desk lamp. GoodDrag ClipDrag GoodDrag ClipDrag6 Points 12 Points 52s 72s 142s73s 83s 87s 61s 73s 133s 140s 43s 49s User intentions Our DragNeXt (a) (b) (c) GoodDrag ClipDrag2 Points 33s 52s 34s 51s 112s 43s Figure 9: Efficiency and quality improvements over the point-based alternating workflow. REMARK 7. Discussion on 2D/3D rotation. 7→We observe that existing DBIE methods particularly excel at yielding 3D rotation effects, ., as shown in Figure 7 (i) and (j), drag- ging the face or the vehicle front enables most methods to rotate them. This capability arises from the strong prior of pretrained diffusion models, which are creative to generate rotated objects based on latent features perturbed by drag op- erations. Among these methods, DragNeXt leverages region- level visual cues, which can obviously guide diffusion models to realize better 3D rotation effects. 7→Current DBIE meth- ods generally fail to perform 2D rotation, as these patterns are not well captured and learned by diffusion models. Drag- NeXt can mitigate this by explicitly using regional geometric transformations. Quantitative Results. The quantitative results are summa- rized in Table 1. The table shows that our method achieves the lowest DAI and LPIPSth and the highest LPIPShh, demon- strating that our method can effectively drag objects from handle regions to target positions. Unsuccessfully dragging objects to target positions results in low LPIPShh—indicating little change in handle regions—and high LPIPSth due to mismatch between original handle regions and edited target regions. Also, the value of LPIPSue indicates that our method can preserve the high fidelity of uneditable areas. Anonymous User Study. We further conducted user studies to validate our method, where 26 participants are invited. Se- lecting the most appropriate result from too many options is time-consuming; to reduce volunteers’ workload, we include the three most relevant methods: ClipDrag, RegionDrag, and FastDrag. The results consistently demonstrate the superior- ity of DragNeXt, ., 84% of the votes favored our results, demonstrating higher quality and better alignment with user expectations (see the supplementary material for details). Method Analysis Efficiency and Quality Improvements. In Figure 9, we ana- lyze the efficiency and quality improvements of our method over the point-based alternating workflow. Without losing generality, we choose two recent typical works—GoodDrag and ClipDrag—as compared baselines. The results consis- tently validate the effectiveness of our method. The point- based methods are often difficult to align with user intentions, ., as shown in Figure 9 (a), GoodDrag and ClipDrag drag only the hand and leave the handbell unmoved using 2 points. Although increasing the number of points can mitigate am- biguity, this largely slows down the alternating drag process Drag input Full PBSI w/o Inter PBSI:1 25s 27s 38s 25s 13s 27s 13s 38s 18s Move the hand holding a metallic- copper handbell. Rotate the cat’s head around its left cheek as the central point. Lower the height of the desk lamp. PBSI:3 PBSI:5 19s 19s 32s PBSI:7 32s 32s 51s 25s 27s 38s User intentions (a) (b) (c) Figure 10: Ablation study on PBSI. “Full PBSI” indicates us- ing the full PBSI strategy, “w/o inter” represents the guidance from intermediate drag states is not considered in PBSI, and “PBSI: N” indicates that PBSI is performed over N timesteps. Zoom in for a better view. and fail to always guide models to yield satisfactory results. Ablation Study on PBSI. We provide ablation studies for our PBSI strategy in Figure 10. Based on the results shown in the figure, we have the following observations. Firstly, remov- ing the guidance of intermediate states significantly degrades output quality, ., the hand in Figure 10 (a) is dragged to an incorrect position, and unnatural results are yielded in Figure 10 (b), thereby demonstrating its important role in achieving high-quality results of DBIE. We also study the im- pact of performing PBSI over different numbers of timesteps. When PBSI is applied to only a single denoising timestep, objects cannot be successfully dragged to target positions. By contrast, increasing the timesteps of performing PBSI obviously improves the quality of edited results, saturating after 5 timesteps, which also indicate the effectiveness of our method in guiding diffusion models to achieve DBIE. Conclusion We propose to address Drag-Based Image Editing (DBIE) from a new perspective—unifying it as a Latent Region Op- timization (LRO) problem that aims to use region-level ge- ometric transformations to optimize latent code to realize drag-based manipulation. By specifying the areas and types of geometric transformations, we can effectively reduce gaps between users’ intentions and actual model behaviors. We also design a new simple-yet-effective editing framework, dubbed DragNeXt. It solves LRO through a Progressive Backward Self-Intervention (PBSI), which simplifies the pro- cedure of DBIE while further enhancing editing quality by fully leveraging region-level structure information and pro- gressive guidance from intermediate transformation states. Physically driven DBIE remains highly challenging. There- fore, in the future, we plan to enhance our DragNeXt by in- tegrating physics-based geometric transformation functions. Acknowledgments This research is supported by the RIE2025 Industry Align- ment Fund – Industry Collaboration Projects (IAF-ICP) (Award I2301E0026), administered by A*STAR, as well as supported by Alibaba Group and NTU Singapore through Alibaba-NTU Global e-Sustainability CorpLab (ANGEL). References Brooks, T.; Holynski, A.; and Efros, A. A. 2023. Instruct- pix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 18392–18402. Cao, M.; Wang, X.; Qi, Z.; Shan, Y.; Qie, X.; and Zheng, Y. 2023. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In Proceedings of the IEEE/CVF international conference on computer vision, 22560–22570. Choi, G.; Jeong, T.; Hong, S.; and Hwang, S. J. 2024. Drag- Text: Rethinking Text Embedding in Point-based Image Edit- ing. arXiv preprint arXiv:. Chung, J.; Hyun, S.; and Heo, J.-P. 2024. Style injection in diffusion: A training-free approach for adapting large- scale diffusion models for style transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8795–8805. Cui, Y.; Zhao, X.; Zhang, G.; Cao, S.; Ma, K.; and Wang, L. 2024. StableDrag: Stable dragging for point-based image editing. In European Conference on Computer Vision, 340– 356. Springer. Dhariwal, P.; and Nichol, A. 2021. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34: 8780–8794. Giannone, G.; Nielsen, D.; and Winther, O. 2022. Few-Shot Diffusion Models. In NeurIPS 2022 Workshop on Score- Based Methods. Hertz, A.; Mokady, R.; Tenenbaum, J.; Aberman, K.; Pritch, Y.; and Cohen-Or, D. 2022. Prompt-to-prompt im- age editing with cross attention control. arXiv preprint arXiv:. Ho, J.; Jain, A.; and Abbeel, P. 2020. Denoising diffusion probabilistic models. Advances in neural information pro- cessing systems, 33: 6840–6851. Hou, X.; Liu, B.; Zhang, Y.; Liu, J.; Liu, Y.; and You, H. 2024. Easydrag: Efficient point-based manipulation on diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8404–8413. Hu, E. J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W.; et al. 2022. Lora: Low-rank adaptation of large language models. ICLR, 1(2): 3. Jiang, Z.; Wang, Z.; and Chen, L. 2025. CLIPDrag: Com- bining Text-based and Drag-based Instructions for Image Editing. In Proceedings of the Thirteenth International Con- ference on Learning Representations. Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochas- tic optimization. arXiv preprint arXiv:. Liao, B.; Wang, X.; Zhu, L.; Zhang, Q.; and Huang, C. 2025. Vig: Linear-complexity visual sequence learning with gated linear attention. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, 5182–5190. Ling, P.; Chen, L.; Zhang, P.; Chen, H.; Jin, Y.; and Zheng, J. 2024. Freedrag: Feature dragging for reliable point-based image editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6860–6870. Liu, H.; Xu, C.; Yang, Y.; Zeng, L.; and He, S. 2024. Drag your noise: Interactive point-based editing via diffusion se- mantic propagation. In Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, 6743–6752. Lu, J.; Li, X.; and Han, K. 2024. Regiondrag: Fast region- based image editing with diffusion models. In European Conference on Computer Vision, 231–246. Springer. Mokady, R.; Hertz, A.; Aberman, K.; Pritch, Y.; and Cohen- Or, D. 2023. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 6038– 6047. Mou, C.; Wang, X.; Song, J.; Shan, Y.; and Zhang, J. 2023. Dragondiffusion: Enabling drag-style manipulation on diffu- sion models. arXiv preprint arXiv:. Mou, C.; Wang, X.; Song, J.; Shan, Y.; and Zhang, J. 2024. Diffeditor: Boosting accuracy and flexibility on diffusion- based image editing. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 8488– 8497. Pan, X.; Tewari, A.; Leimkühler, T.; Liu, L.; Meka, A.; and Theobalt, C. 2023. Drag your gan: Interactive point-based manipulation on the generative image manifold. In ACM SIGGRAPH 2023 conference proceedings, 1–11. Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; and Om- mer, B. 2022. High-resolution image synthesis with latent dif- fusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10684–10695. Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; and Aberman, K. 2023. Dreambooth: Fine tuning text-to- image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 22500–22510. Shi, Y.; Xue, C.; Liew, J. H.; Pan, J.; Yan, H.; Zhang, W.; Tan, V. Y.; and Bai, S. 2024. Dragdiffusion: Harnessing diffusion models for interactive point-based image editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8839–8849. Song, J.; Meng, C.; and Ermon, S. 2020. Denoising diffusion implicit models. arXiv preprint arXiv:. Sun, H.; Li, W.; Liu, J.; Chen, H.; Pei, R.; Zou, X.; Yan, Y.; and Yang, Y. 2024. Coser: Bridging image and language for cognitive super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 25868–25878. Tan, W.; Chen, S.; and Yan, B. 2023. Diffss: Diffusion model for few-shot semantic segmentation. arXiv preprint arXiv:. Wu, R.; Yang, T.; Sun, L.; Zhang, Z.; Li, S.; and Zhang, L. 2024. Seesr: Towards semantics-aware real-world image super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 25456–25467. Zhang, L.; Rao, A.; and Agrawala, M. 2023. Adding condi- tional control to text-to-image diffusion models. In Proceed- ings of the IEEE/CVF international conference on computer vision, 3836–3847. Zhang, Y.; Huang, N.; Tang, F.; Huang, H.; Ma, C.; Dong, W.; and Xu, C. 2023. Inversion-based style transfer with diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10146–10156. Zhang, Z.; Liu, H.; Chen, J.; and Xu, X. 2024. GoodDrag: To- wards good practices for drag editing with diffusion models. arXiv preprint arXiv:. Zhao, K.; Shi, J.; Zhu, B.; Zhou, J.; Shen, X.; Zhou, Y.; Sun, Q.; and Zhang, H. 2025. Real-Time Motion- Controllable Autoregressive Video Diffusion. arXiv preprint arXiv:. Zhao, X.; Guan, J.; Fan, C.; Xu, D.; Lin, Y.; Pan, H.; and Feng, P. 2024. FastDrag: Manipulate anything in one step. arXiv preprint arXiv:. Zhou, J.; Zhou, Y.; Zhao, K.; Xu, Q.; Zhu, B.; Hong, R.; and Zhang, H. 2025a. Streaming Drag-Oriented Interac- tive Video Manipulation: Drag Anything, Anytime! arXiv preprint arXiv:. Zhou, Y.; Chen, X.; Guo, Y.; Yu, J.; Hong, R.; and Tian, Q. 2024. Advancing incremental few-shot semantic segmenta- tion via semantic-guided relation alignment and adaptation. In International Conference on Multimedia Modeling, 244– 257. Springer. Zhou, Y.; Xu, Q.; Cui, J.; Zhou, J.; Zhang, J.; Hong, R.; and Zhang, H. 2025b. CARE Transformer: Mobile-Friendly Lin- ear Visual Transformer via Decoupled Dual Interaction. In Proceedings of the Computer Vision and Pattern Recognition Conference, 20135–20145. Zhu, L.; Huang, Z.; Liao, B.; Liew, J. H.; Yan, H.; Feng, J.; and Wang, X. 2025. Dig: Scalable and efficient diffusion models with gated linear attention. In Proceedings of the Computer Vision and Pattern Recognition Conference, 7664– 7674. A. DDIM Sampling and Inversion In this section, we provide more details about DDIM (Song, Meng, and Ermon 2020), which is employed in our editing framework. DDIM defines the sampling of diffusion models as a non-Markovian process: q(zt−1|zt, z0) = (8) N ( √ αt−1z0 + √ 1− αt−1 − σ2t · zt − √ αtz0√ 1− αt , α2tI ) . Therefore, it can be formulated by using Equation (9): zt−1 = √ αt−1 ( zt − √ 1− αt εΘ(zt)√ αt ) (9) + √ 1− αt−1 − σ2t · εΘ(zt) + σtε, where ε ∼ N (0, I) represents standard Gaussian noise and is independent of the latent code zt, and σt = η √ (1− αt−1)/(1− αt) √ 1− αt/αt−1 for all timesteps. When setting η = 1, Equation (9) becomes DDPM, equaling to a stochastic differential equation (SDE). Setting η = 0 yields a deterministic sampling process, corresponding to an ordinary differential equation (ODE). Given the sampling process in Equation (9), DDIM inversion can be described by Equation (10): zt+1 = √ αt+1 √ αt ( zt − √ 1− αt · εΘ(zt) ) (10) + √ 1− αt+1 · εΘ(zt), (11) which is based on the assumption that the ODE is invertible in the limit of small step sizes. B. Implementation Details We implement our DragNeXt in PyTorch and, following prior works (Zhang et al. 2024; Lu, Li, and Han 2024; Zhao et al. 2024; Jiang, Wang, and Chen 2025), employ Stable-Diffusion-v1-5 as the base model to ensure fair comparison between methods. We optimize the learn- able parameters using the Adam optimizer (Kingma and Ba 2014) with a learning rate of 2× 10−2. Following (Shi et al. 2024; Zhang et al. 2024), we finetune diffusion models via LoRA (Hu et al. 2022) with a rank of 16. The number of denoising timesteps is set to Tmax = 50, and the inversion strength is fixed at , meaning that each input image is inverted to the timestep T = 50 × = 38, respectively. Also, T ′ and K are set to 33 and 10. Following (Shi et al. 2024; Zhang et al. 2024; Lu, Li, and Han 2024), we incorpo- rate mutual self-attention (Cao et al. 2023) starting from the 10-th layer of UNet. C. NextBench: a Benchmark for Reliable DBIE To better assess model performance on Reliable DBIE, we propose a new benchmark, NextBench, comprising 234 care- fully annotated samples with detailed drag instructions and corresponding user intentions. As illustrated in Figure 11 (a), each drag instruction is specified by six key components: handle regions, editable regions, center points, handle points, target points, and transformation types. Annotators explic- itly record their intentions for each sample, enabling a more faithful evaluation of how well generated results align with user expectations. NextBench is the first benchmark to explicitly incorporate constraints on both the type and region of geometric trans- formations for dragging, serving as a critical step toward realizing Reliable DBIE. To streamline data collection, we developed a user-friendly web-based system, following the pipeline illustrated in Figure 11 (b), which will be publicly released soon. NextBench offers diverse content, including 200 real images and 34 AI-generated images, covering 103 animal images, 18 artistic paintings, 32 landscapes, 24 plant images, 31 human portraits, and 26 everyday objects such as furniture and vehicles. As a high-quality benchmark is essen- tial for driving progress in this field, we are committed to the continuous maintenance and improvement of NextBench. Evaluation metrics. Following prior work (Zhang et al. 2024; Shi et al. 2024), we use LPIPS and DAI to evaluate performance on NextBench. DAI aims to measure distance between patches centered at handle points and those centered at corresponding target points, which is first introduced by GoodDrag (Zhang et al. 2024): DAI = 1 n n∑ i=1 ∥∥ϕ(z0)π(hi;γ) − ϕ(ẑ0)π(gi;γ)∥∥22 (1 + 2γ)2 , (12) where ϕ denotes the VAE decoder that maps z0 to the RGB image space, and π(hi, γ) represents a patch centered at hi with a radius γ. Equation (12) computes the mean squared error between the patch at the handle point hi of ϕ(z0) and the corresponding patch at the target point gi of ϕ(ẑ0). To better assess region-level DBIE, LPIPS is computed between original images and edited results in three parts: (i) LPIPSue for uneditable regions, (ii) LPIPSth for consistency between handle and target regions, and (iii) LPIPShh for handle regions, which are shown in Equation (13), (15), and (14), respectively: LPIPSue = 1 m m∑ j=1 LPIPS(xj [M ], x̄j [M ]) (13) LPIPSth = 1 mn m∑ j=1 n∑ i=1 LPIPS(xj [ϑi], x̄j [ρi]) (14) LPIPShh = 1 mn m∑ j=1 n∑ i=1 LPIPS(xj [ϑi], x̄j [ϑi]). (15) In the above equations, LPIPS(·) measures LPIPS values between input images, and [·] selects regions where given binary masks have a value of 1. xj and x̄j represent a pair of an original image and an edited result, {ϑi}i=1,...,n represent handle regions given by users, and {ρi}i=1,...,n denote target regions in edited results. Target regions can be estimated by considering drag instructions given by users, as mentioned in Definition 2 and Equation (5) of the paper’s main body. What can these metrics do? According to Equation (13), (14), (15), we summarize the functions of these metrics here: • LPIPSue aims to measure LPIPS between the uneditable regions of an original input image and an edited result. A lower LPIPSue indicates better preservation of uneditable regions, whereas a higher LPIPSue implies that uneditable regions are altered during dragging. • LPIPSth aims to measure the consistency between the handle regions of an original input image and the target regions of an edited result. A lower LPIPSth means handle regions are successfully dragged to target positions; vice versa, failing to drag objects to target positions results in a higher LPIPSth due to the mismatch between original handle regions and target areas in editable results. • LPIPShh is to measure divergence between the handle regions of an input image and an edited result. Successful dragging handle regions to target positions should result in a higher LPIPShh, reflecting the change of visual content; otherwise, handle regions in the original and edited image are the same, resulting in a lower LPIPShh. Why not point-based metrics? We do not employ point- based evaluation metrics in NextBench, such as the Mean Distance (MD) between handle and target points (Shi et al. 2024), as they are incompatible with the region-based na- ture of our proposed Reliable DBIE. Unlike the previous point-based DBIE setting, Reliable DBIE emphasizes region- level consistency, rendering point-based metrics biased and insufficient for evaluating model performance in this context. D. Translation, Deformation, and 2D/3D Rotation In this work, we adopt two geometric transformations widely used in computer graphics: translation and rotation. We observe that these two transformation functions can cover most DBIE scenarios, including translation, deformation, and 2D/3D rotation. Theoretically, our approach does not have restrictions on the types of geometric transformations and is compatible with other transformations used in computer graphics. This paper aims to provide a new foun- dational framework for DBIE, and we leave the exploration of more transformations within this framework to our future research. Translation Translation refers to moving an object or region from one lo- cation to another withou altering its shape or size. Suppose a point p = (x, y) is translated along a direction d = (dx, dy); its new coordinates p′ = (x′, y′) can be computed using Equation (16): [ x′ y′ 1 ] = [ 1 0 dx 0 1 dy 0 0 1 ] ︸︷︷︸ Translation matrix [ x y 1 ] (16) where the first term on the right-hand side of the equation is commonly referred to as the translation matrix. Region-level translation is achieved by applying this translation to every point within the region. According to the handle point hi and the target point gi, we can calculate the translation offset di about the region ϑi as di = gi − hi = (x g i − x h i , y g i − y h i ). Deformation Deformation refers to a non-rigid transformation that alters the shape of an object. As shown in Figure 12 (a), previous ✓ Center points: [[240, 295],] ✓ Handle points: [[328, 197],] ✓ Target points: [[269, 178],] ✓ Type: rotation ✓ User intention: rotate the sculpture. ✓ Center points: None ✓ Handle points: [[232, 405],] ✓ Target points: [[122, 405],] ✓ Type: translation ✓ User intention: please move the potted plant to the left. Actions on our web interfaceHandle region Editable regionImage (a) Examples from NextBench Save (b) Data collection Input image Web Web rotate the sculpture. please move the potted plant to the left. Figure 11: A brief illustration of samples from our NextBench. Image & Drag Result-1? Result-2? Image & Drag Result-1! (a) Point-based deformation: Result-1 or Result-2? (b) Region-oriented deformation: Result-1! Upgrading point-based deformation to region-oriented deformation! Figure 12: Point-Based VS. Region-Oriented Deformation. (a) Translation (b) Deformation move move move move move move move move move move DragNeXtImage & Drag DragNeXtImage & Drag DragNeXtImage & Drag Figure 13: Object Translation VS. Deformation. Transla- tion is moving the entire region of an object, whereas defor- mation can be seen as translating an object’s subregion. DBIE methods achieve object deformation by dragging han- dle points to corresponding target positions. In contrast, our DragNeXt extends this paradigm from point-based to region- oriented deformation by reformulating object deformation as the translation of the local part of an object, ., as shown in Figure 12 (b), dragging the edge region of the lamp to realize its deformation. In the following, we provide more discussions and analysis for our region-oriented deformation. Image & Drag Gradient map Image & Drag Gradient mapDragNeXt DragNeXt (a) Translation-1 (b) Deformation-1 (d) Deformation-2(c) Translation-2 Image & Drag Gradient mapImage & Drag Gradient map DragNeXtDragNeXt Figure 14: Visualization of gradients back-propagated to latent code when dragging an object or its local area. Point-Based VS. Region-Oriented Deformation. Using points along to instruct the dragging process has a major drawback: handle points can only offer limited contextual information and cannot accurately specify which regions should be dragged or deformed. For example, the drag points in Figure 12 (a) can be interpreted as either drag- ging the whole lamp or the lamp’s edge, introducing severe ambiguity into the dragging process. By contrast, extend- ing point-based instructions to region-level guidance enables full exploitation of the information of pixels around handle points and can more clearly specify which regions need to be dragged. We also observe that incorporating region-level vi- sual contexts leads to higher-quality deformation results, ., Result-1 in Figure 12 (b) is obviously better than Result-1 in Figure 12 (a). These observations validate the importance of extending point-based deformation to region-based deforma- tion! Region-Oriented Deformation VS. Translation. As can be seen from Figure 13, object translation can be regarded as moving the entire region of an object, whereas deformation can be seen as moving an object’ subregion. In this work, we extend the previous point-based deformation to region- oriented deformation, enabling more reliable DBIE by lever- aging region-level visual context. However, similar to prior methods (Zhang et al. 2024; Jiang, Wang, and Chen 2025; Mou et al. 2023; Lu, Li, and Han 2024; Zhao et al. 2024), this approach still cannot achieve physics-driven deformation results. Currently, achieving physics-driven DBIE results re- mains highly challenging. We leave it for our future work, and incorporate physics-based transformation functions in our editing framework. REMARK 8. In Figure 14, we provide a visualization of gradients back-propagated to latent code when dragging an object or its subregion. In Figure 14 (a) and (b), the object and its local region are dragged over a short distance. The gradients are primarily localized in the regions that require manipulation, whereas the areas that do not need adjustment remain unaffected. In Figure 14 (c) and (d), the desk lamp and its local area are dragged over a longer distance. We observe that, regardless of the dragging distance, regions with distinct appearances remain separated from the target regions. For instance, the black desk lamp does not disturb the appearance of the white background in Figure 14 (c); in Figure 14 (d), the background does not affect the extended region of the desk lamp caused by the dragging operation. This phenomenon can be attributed to strong prior knowledge and patterns learned by pretrained diffusion models from vast amounts of training data. 2D/3D Rotation Rotation refers to the process of rotating an object or region around a specified point by a certain angle. The rotation operation can be categorized into two types: 2D rotation and 3D rotation, which are one-by-one introduced below. 2D Rotation. Suppose we aim to rotate a region around a center point c = (xc, yc) by an angle θ. For each point p = (x, y) in this region, the updated coordinates p′ = (x′, y′) can be computed using Equation (17):[ x′ y′ 1 ] = [ 1 0 xc 0 1 yc 0 0 1 ] ︸︷︷︸ Back to c [ cos θ − sin θ 0 sin θ cos θ 0 0 0 1 ] ︸︷︷︸ Rotation matrix [ 1 0 −xc 0 1 −yc 0 0 1 ] ︸︷︷︸ To the origin [ x y 1 ] , (17) where the middle term on the right-hand side of the equation is commonly referred to as the rotation matrix, while the remaining matrices are used to translate regions either to the origin or back to the center point c. According to the handle point hi = (xhi , y h i , ), target point gi = (x g i , y g i , ), and center point ci = (xci , y c i , ) given by users, we can calculate the rotation matrix as follows: cos θ = (xhi − x c i )(x g i − xci ) + (y h i − y c i )(y g i − yci )√ (xh i − xc i )2 + (yh i − yc i )2 √ (x g i − xc i )2 + (y g i − yc i )2 (18) sin θ = (xhi − x c i )(y g i − yci ) − (y h i − y c i )(x g i − xci )√ (xh i − xc i )2 + (yh i − yc i )2 √ (x g i − xc i )2 + (y g i − yc i )2 . (19) 3D Rotation. We compare 2D and 3D rotations in Figure 15. The 2D rotation operation can be explicitly modeled by using a geometric transformation function. However, there is no predefined mapping function capable of well handling 3D rotation, since it involves a complex non-rigid transformation 84% 3% 4% 5% 4% DargNeXt ClipDrag RegionDrag FastDrag None of the results is satisfying Table 2: Voting results of our anonymous user study. that inherently depends on strong priors—such as object appearance, shape, and structural consistency—during the dragging process. For example, as shown in the last column of Figure 15, the warpage function, used in FastDrag (Zhao et al. 2024), easily leads to unnatural and unrealistic deterioration of objects. Observation 1. We have observed an interesting phe- nomenon in our experiments: despite 3D rotation being in- herently more challenging than 2D rotation, current DBIE models surprisingly perform better on 3D rotation than on 2D rotation! As shown in the third row of Figure 15, most DBIE methods are able to rotate the 3D angle of the sculpture by directly dragging the face leftward; by contrast, however, none of the methods can achieve 2D rotation of the face of the sculpture or the raccoon, as exhibited in the first and second rows. We conclude the reasons for this counterintuitive phe- nomenon in twofold: (i) the strong capability of current DBIE methods in 3D rotation actually arises from the strong prior of pretrained diffusion models, which are creative to generate rotated objects based on latent features perturbed by drag operations; (ii) 2D rotation can be explicitly modeled by using geometric transformation functions, but the 2D rotation pattern is not well captured and learned by diffusion models during pretraining. Point-Based Rotation VS. Region-Based Rotation. The concept of 3D rotation in DBIE was formally introduced for the first time in GoodDrag (Zhang et al. 2024). As shown in Figure 16 (a), GoodDrag uses handle and target points to instruct the dragging process of 3D rotation, ., dragging handle points to the positions of target points. As shown in Figure 16 (b), we also extend the point-based 3D rotation to the region-based 3D rotation. Extending point-based instruc- tions to region-level guidance enables full exploitation of the information of pixels around handle points and can more clearly specify which regions need to be dragged, thereby helping models to achieve better 3D rotation effects. REMARK 9. 7→Relationships between point-based 3D rotation and translation. We argue there is no essential difference between point-based 3D rotation and translation, which both aim to drag handle points to the positions of target points. But translating the handle points of an object may lead ClipDrag RegionDragImage&Drag FastDragDragDiffusion GoodDragDragNeXt DiffEditor FreeDrag DragNoise 2 D r o ta ti o n 3 D r o ta ti o n Figure 15: Illustrations of 2D Rotation and 3D Rotation. (a) Point-based 3D rotation (b) Region-based 3D rotation Figure 16: Point-Based VS. Region-Based 3D Rotation. Point-based 3D rotation is introduced by GoodDrag (Zhang et al. 2024), and region-based 3D rotation is used in our DragNeXt. The figure (a) is borrowed from GoodDrag. to diffusion models to draw 3D rotation effects, ., the car’s 3D rotation can be realized by translating its front to the right as shown in the instruction in Figure 16. 7→Relationships between region-based 3D rotation and translation. Similar to the point-based rotation used in GoodDrag, Region-based 3D rotation can be seen as translating an object’s subregion. E. More Experimental Results Drag-Based Image Editing at Relatively Long Distances Dragging objects at a relatively long distance remains a major challenge in the current field of DBIE. Most of the existing methods only support dragging objects or regions over a short distance and are incapable of handling long-distance drag- based editing tasks. Although our method is not specifically designed for long-distance DBIE, we are surprised that it still exhibits superior performance compared to the recent counterparts. As exemplified in Figure 17, we successfully drag the desk lamp, the stone, and the person’s hand over a relative long distance while maintaining high editing qual- ity. In contrast, the compared methods either fail to achieve long-distance dragging or to yield satisfactory quality. For instance, FastDrag easily causes unnatural deformation of objects, while RegionDrag is prone to resulting in artifacts in edited regions, as we mentioned in the main body of the paper. Also, ClipDrag, DiffEditor, and GoodDrag suffer from severe loss of regional details during long-distance dragging. Why is DragNeXt superior? We believe that the superiority of our method in long-distance dragging tasks lies in two aspects: (i) DragNeXt fully exploits region-level contextual information, effectively alleviating the influence of losing sparse handle points during long-distance dragging; and (ii) it incorporates progressive guidance from intermediate drag states, enabling a smoother and more stable dragging process. The strength in handling long-distance dragging tasks re- veals that our method has great potential for achieving DBIE in complex scenarios. We plan to explore this point in our subsequent work. More Visualized Results We provide more visualized results obtained by DragNeXt in Figures 18, 19, and 20 as a supplementary for the experiments provided in the main body of the paper. These results lead to conclusions consistent with those discussed in Section , fur- ther confirming the effectiveness of our approach in aligning with user intentions. For instance, in Figure 18(d), our method successfully moves the potted plant to the left, whereas all compared methods fail: DragDiffusion does not alter the plant’s position, and DiffEditor, GoodDrag, ClipDrag, Re- gionDrag, and FastDrag produce noticeable unnatural defor- mations. Moreover, our method also demonstrates superior performance in 2D rotation tasks, yielding more natural and visually consistent results than the compared counterparts, nearly all of which fail to generate satisfactory outcomes. We have also summarized our results in videos to provide a clearer and more intuitive illustration of the dragging effects yielded by our DragNeXt; for details, please kindly refer to Image & Drag Our DragNeXt GoodDrag ClipDrag RegionDrag FastDragDiffEditor Figure 17: Experimental results on Drag-Based Image Editing at Relatively Long Distances. the uploaded files in the supplementary material. F. Anonymous User Study Since quantitative evaluation metrics may not fully demon- strate the effectiveness of our method in addressing the ambi- guity issue and expectation-result misalignment, we addition- ally provide an anonymous user study, where a total of 26 participants are invited. The details about the questionnaire is summarized in Figure 21, 22, 23, and 24. The question- naire totally consists of 15 questions, where the 12 items are closely related to the ambiguity issues mentioned in Proposi- tion 1, and the 3 items are used to assess the quality of edited images. Also, for each question, five candidate options are provided: • the options A—D correspond to randomly ordered re- sults generated by DragNeXt, ClipDrag, RegionDrag, and FastDrag; • the option E indicates that none of the results are satisfac- tory. The reason for limiting the options to A–E is to reduce the participants’ workload, as selecting the most suitable result from too many options would be time-consuming and not user-friendly. In Figure 2, we provide the anonymous voting results from the invited participants. As can be seen from the figure, the voting results demonstrate the effectiveness of method again, ., the average results from the participants indicate that 84% of our edited images are better than those of the compared models. Why are ClipDrag, RegionDrag, and FastDrag Chosen as Compared Models? These three methods are the most relevant to our research. ClipDrag (Jiang, Wang, and Chen 2025) addresses ambiguity in DBIE by incorporating tex- tual guidance; in contrast, we reformulate DBIE as a Latent Region Optimization (LRO) problem to alleviate ambigu- ity while further improving the efficiency of the alternating workflow. DragNeXt advances RegionDrag and FastDrag by transforming forward, optimization-free manipulation into backward, self-interventional latent optimization, thereby fully leveraging the prior of pretrained diffusion models to avoid unrealistic and unnatural deterioration. G. Limitations and Future Work We summarize the pseudocode of our method in Algorithm 1. Here, we elaborate on the limitations of our current work, and introduce the corresponding plan for our future research. • Limited types of used geometric transformation func- tions. In our current work, we adopt two geometric transformations—translation and 2D rotation. Although these two transforms can realize translation, deformation, and 2D/3D rotation effects in DBIE, there may still exist some other useful transformation functions that have not yet been considered, such as scaling and shearing. In the future, we will explore more types of geometric transform functions. The main challenge of incorporating more geo- metric transformations lies in how to unify them into the current format of drag instructions. For example, dragging may lead not only to regular scaling but also to irregular or non-uniform scaling effects; however, properly defining scaling factors along each direction remains a non-trivial problem. • Physically-driven drag effects. Achieving physics- driven editing results remains a key challenging problem in DBIE. In this work, we extend point-guided dragging Algorithm 1: Pseudocode of our proposed method. Input: an input image x, user-specified handle regions E = {ϑi}i=1,...,n and drag instructions C = {Ti,Oi}i=1,...,n, hyperparamters T , T ′, and K; z0 = VAE Encoder(x), {z1, . . . ,zT } = Inversion(z0) ; // Encoding and inversion. // The denosing phase begins. for t = T to 0 do if T ′ < t < T then z0t ← zt; // Performing the PBSI strategy. for k = 0 to K − 1 do {ρt,ki ,Πϑi→ρt,ki }i=1,...,n =⋃ i=1,...,n δ(ϑi, Ti,Oi, t, k); LLRO = ∑ i=1,...,n ∥F(zkt ) ∗ ρ t,k i − F(zt).()[Πϑi→ρt,ki ] ∗ ρ t,k i ∥1 +RM ; z k+1 t ←− z k t − ∂LLRO ∂zkt ; end zt−1 = z K−1 t − εΘ(z K−1 t ; t, c); end else zt−1 = zt − εΘ(zt; t, c); // Vanilla denoising. end end x̄ = VAE Decoder(z0); // Decoding latent embeddings. Output: an edited image x̄; to a region-oriented geometric transformation paradigm to alleviate ambiguity and better leverage contextual infor- mation. In future work, we aim to realize physics-driven editing by incorporating physically based transformation functions. • Long-distance DBIE. Our DragNeXt is not specifically designed for long-distance DBIE, despite achieving su- perior results in this challenging setting. In future work, we plan to conduct more experiments to identify the key factors for achieving high-quality results in long-distance DBIE and to further enhance our DragNeXt. • DBIE with Linear Attention. Diffusion models usually adopt standard self-attention with quadratic complexity, resulting in high computational costs. Recent advances in linear attention (Zhu et al. 2025; Zhou et al. 2025b; Liao et al. 2025) offer an effective way to alleviate this compu- tational bottleneck. In future work, we plan to investigate incorporating linear attention into DBIE to further accel- erate the editing process. ClipDrag RegionDragImage & Drag FastDragDragDiffusion GoodDrag 25s 25s 38s 41s 2s 8s 66s 2s 8s 31s 29s 32s 33s 48s 52s 112s 2s 8s User intentions Move the cup to the left and do not change its shape. (a) Move the hand holding a metallic- copper handbell. (i) Please lower the height of the desk lamp. (l) Our DragNeXt DiffEditor 20s 20s 21s Move down the whole handle of the coffee cup. (b) Please move the potted plant to the left. (d) Drag the small cactus upward to make it taller! (e) 25s 25s 25s Move the head of the woman to the left. (g) Move the woman’s eyes upward. (h) Please move the face of the man to the left. (j) Please move the head of the man to the left. (k) Please make the size of the desk lamp bigger. (m) 38s 25s 25s 25s 25s 32s 32s 24s 32s 21s 32s 32s 19s 20s 20s 20s 30s 45s 32s 47s 20s 29s 29s20s 20s 32s 78s21s52s 38s 71s 38s 75s 31s 31s 54s 59s Pull the top of the cactus downward, shortening its height. (f) 25s 31s 20s 32s 33s 8s 8s 8s 8s 8s 8s 8s 8s 8s 2s 2s 2s 2s 2s 2s 2s 2s 2s Move the coffee cup to the top corner. (c) 25s 32s 38s 8s2s19s 30s Figure 18: More experimental results—part I. ClipDrag RegionDragImage & Drag FastDragDragDiffusion GoodDrag 7s 27s 25s 25s 38s 2s 6s 27s 31s 34s 41s 40s 32s 33s 54s 42s 62s 75s 83s 74s 63s 33s 2s 2s 2s 2s 6s 6s 8s 6s User intentions Rotate the cat’s head around its left cheek as the central point. (a) Pull the top of the hat upward. (d) Press the top of the downward. (e) Drag the root of the tree and widen the road. (g) Pull the bear‘s head up and rotate its hand. (j) Please expand the visible horizon line. (m) Our DragNeXt DiffEditor 20s 19s 20s 20s 20s 26s 25s 20s 34s 63s 2s 6s 25s 31s 31s20s 47s 2s 6s Please move the hat downward and do not change its shape. (f) 28s 40s 74s21s 81s 2s 9s Please make the tree hole smaller. (h) 28s Please drag the tree hole to the right. (i) 27s 32s 78s 28s 20s 20s32s 78s 35s 2s 2s 8s 7s Move the bear to the right. (k) 28s 33s 31s20s 43s 2s 9s Please drag the sun upward. (l) 25s 34s 51s20s 53s 2s 6s 25s 32s 31s20s 34s 2s 9s Rotate the cat’s head to the left around its chin as the central point. (b) Please move the head of the cat to the left. (c) Figure 19: More experimental results—part II. ClipDrag RegionDragImage & Drag FastDragDragDiffusion GoodDrag Rotate the raccoon’s head around its chin as the central point. 27s 27s Move the giraffe's head and extend its outline 31s 34s 44s 67s 71s 37s 2s 4s 8s 6s (l) (m) User intentions Our DragNeXt DiffEditor 21s 20s Rotate the sculpture to the left around its nose. (a) 26s 25s 27s 26s 25s 26s 25s 25s 27s 27s 25s Please move the sculpture to the left and do not change its pose. (b) Move the eyes, nose, and mouth of the sculpture to the right. (c) Rotate the head of the boy to the left around his mouth. (d) Please move the head of the boy to the right corner. (e) Please rotate the head of the wolf to the right. (f) Please expand the borders of the artwork. (g) Please drag the yellow flowers upward. (h) Move the woman's head to the left on the canvas. (i) Please rotate the head of the alpaca. (j) Please rotate the head of the alpaca. (k) 27s 27s 33s 25s 22s 41s 26s 26s 32s 30s 49s 43s 30s 28s 25s 22s 78s 39s 18s 26s 56s 83s22s 20s 20s 20s 19s 19s 20s 20s 20s 20s 20s 7s 7s 10s 7s 7s 8s 7s 6s 7s 7s 9s 51s 53s 44s 37s 39s 91s 34s 19s 57s 66s 101s 2s 2s 2s 2s 2s 2s 2s 2s 2s 2s 2s Figure 20: More experimental results—part III. Anonymous User Study on Drag-Based Image Editing *Requirement: Please choose the result that best follows the editing instruction and exhibits the highest editing quality. (1) Select the result that best follows the instruction with the highest editing quality:_____ *Editing instruction: Rotate the sculpture to the left around its nose. E: None of the results are satisfying! Candidates options A B C D (2) Select the result that best follows the instruction with the highest editing quality:______ *Editing instruction: Rotate the head of the boy to the left around his mouth. E: None of the results are satisfying! Candidates options A B C D (3) Select the result that best follows the instruction with the highest editing quality:______ *Editing instruction: Move the woman's head to he left on the canvas. E: None of the results are satisfying! Candidates options A B C D (4) Select the result that best follows the instruction with the highest editing quality:______ *Editing instruction: Please move the sculpture to the left and do not change its pose. Candidates options A B C D E: None of the results are satisfying! Figure 21: Questionnaire—Part I (questions (1)∼(4)). (5) Select the result that best follows the instruction with the highest editing quality:_____ *Editing instruction: Please move the head of the boy to the right corner . E: None of the results are satisfying! Candidates options A B C D (6) Select the result that best follows the instruction with the highest editing quality:______ *Editing instruction: Move the cup to the left and do not change its shape. E: None of the results are satisfying! Candidates options A B C D (7) Select the result that best follows the instruction with the highest editing quality:______ *Editing instruction: Please move the potted plant to the left. E: None of the results are satisfying! Candidates options A B C D (8) Select the result that best follows the instruction with the highest editing quality:______ *Editing instruction: Lower the height of the desk lamp. Candidates options A B C D E: None of the results are satisfying! Figure 22: Questionnaire—Part II (questions (5)∼(8)). (9) Select the result that best follows the instruction with the highest editing quality:_____ *Editing instruction: Rotate the cat’s head around its left cheek as the central point. E: None of the results are satisfying! Candidates options A B C D (10) Select the result that best follows the instruction with the highest editing quality:______ *Editing instruction: Please move the hat down. E: None of the results are satisfying! Candidates options A B C D (11) Select the result that best follows the instruction with the highest editing quality:______ *Editing instruction: Move the head of the woman to the left. E: None of the results are satisfying! Candidates options A B C D (12) Select the result that best follows the instruction with the highest editing quality:______ *Editing instruction: Move the hand holding a metallic-copper handbell. Candidates options A B C D E: None of the results are satisfying! Figure 23: Questionnaire—Part III (questions (9)∼(12)). (13) Select the result that best follows the instruction with the highest editing quality:_____ *Editing instruction: Move down the whole handle of the coffee cup. E: None of the results are satisfying! Candidates options A B C D (14) Select the result that best follows the instruction with the highest editing quality:______ *Editing instruction: Pull the top of the cactus downward, shortening its height. E: None of the results are satisfying! Candidates options A B C D (15) Select the result that best follows the instruction with the highest editing quality:______ *Editing instruction: Drag the root of the tree and widen the road. E: None of the results are satisfying! Candidates options A B C D Figure 24: Questionnaire—Part IV (questions (13)∼(15)).

联系我们

智库文档公众号

客服微信

南洋理工大学等DragNeXt：重新思考基于拖拽的图像编辑.pdf

下载

标签

相关文档

联系我们

意见反馈