DragNeXt: Rethinking Drag-Based Image Editing
Yuan Zhou1, Junbao Zhou1, Qingshan Xu1*, Kesen Zhao1, Yuxuan Wang1, Hao Fei2,
Richang Hong3, Hanwang Zhang1
1Nanyang Technological University
2National University of Singapore
3Hefei University of Technology
Abstract
Drag-Based Image Editing (DBIE), which allows users to ma-
nipulate images by directly dragging objects within them, has
recently attracted much attention from the community. How-
ever, it faces two key challenges: (i) point-based drag is often
highly ambiguous and difficult to align with user intentions;
(ii) current DBIE methods primarily rely on alternating be-
tween motion supervision and point tracking, which is not
only cumbersome but also fails to produce high-quality re-
sults. These limitations motivate us to explore DBIE from a
new perspective—unifying it as a Latent Region Optimiza-
tion (LRO) problem that aims to use region-level geometric
transformations to optimize latent code to realize drag manip-
ulation. Thus, by specifying the areas and types of geomet-
ric transformations, we can effectively address the ambiguity
issue. We also propose a simple yet effective editing frame-
work, dubbed DragNeXt. It solves LRO through Progressive
Backward Self-Intervention (PBSI), simplifying the overall
procedure of the alternating workflow while further enhancing
quality by fully leveraging region-level structure information
and progressive guidance from intermediate drag states. We
validate DragNeXt on our NextBench, and extensive experi-
ments demonstrate that our proposed method can significantly
outperform existing approaches.
Code —
Extended version —
Introduction
Diffusion models (Rombach et al. 2022; Dhariwal and Nichol
2021) have made remarkable progress in the field of text-to-
image generation, serving as foundational models for a wide
range of generative tasks, such as image super-resolution
(Wu et al. 2024; Sun et al. 2024), style transfer (Zhang et al.
2023; Chung, Hyun, and Heo 2024), text-based image edit-
ing (Brooks, Holynski, and Efros 2023; Hertz et al. 2022),
few-shot learning (Zhou et al. 2024; Giannone, Nielsen, and
Winther 2022; Tan, Chen, and Yan 2023). Nevertheless, an
inherent limitation of diffusion models lies in their poor con-
trollability, which brings more challenges to fine-grained
editing tasks, especially those that require interactive-style
*Corresponding author.
Copyright © 2026, Association for the Advancement of Artificial
Intelligence (). All rights reserved.
Move the hand
holding a metallic-
copper handbell.
DragNeXt ClipDrag RegionDrag
Rotate the cat’s head
around its left cheek
as the central point.
Drag inputUser intentions FastDrag
Move the cup to the
left and do not
change its shape.
Left cheek
41s 2s 8s27s
35s 2s 6s27s
25s 66s 2s 8s
25s(a)
(b)
(c)
Figure 1: Examples of the key issues in current DBIE. (i)
Text prompts used in ClipDrag (Jiang, Wang, and Chen 2025)
remain insufficient for solving the ambiguity issue; (ii) prede-
fined mapping functions employed by FastDrag (Zhao et al.
2024) and RegionDrag (Lu, Li, and Han 2024) boost effi-
ciency but severely compromise editing quality. The numbers
given in the upper-left corner of images indicate the latency
for dragging the regions of handle points to target positions.
manipulation (Liu et al. 2024; Zhao et al. 2025; Zhou et al.
2025a).
Recent studies (Shi et al. 2024; Cui et al. 2024) have ex-
plored the use of diffusion models for Drag-Based Image
Editing (DBIE), which enables users to manipulate images
by directly dragging objects via a set of user-specified handle
and target points. Existing diffusion-based DBIE methods
predominantly employ a point-based alternating optimization
strategy (Liu et al. 2024; Cui et al. 2024; Ling et al. 2024;
Hou et al. 2024; Choi et al. 2024; Shi et al. 2024), where
Step-1: optimizing the features of handle points toward corre-
sponding target positions by performing point motion super-
vision; Step-2: updating handle point positions iteratively via
KNN-based point tracking.
However, the point-based alternating workflow inevitably
brings two issues to DBIE: (i) point-based drag suffers from
high ambiguity and struggles to align with users’ intentions,
thereby severely compromising the precision of the editing
process; (ii) tackling DBIE through an alternating procedure
of motion supervision and point tracking is not only cumber-
some but also fails to always yield high-quality results, as
accurately estimating the updated positions of handle points
in each drag iteration is both challenging and time-consuming
ar
X
iv
:2
50
6.
07
61
1v
2
[
cs
.C
V
]
2
5
Ja
n
20
26
(Ling et al. 2024; Cui et al. 2024). Besides, given that point-
based motion supervision offers only limited structural cues
about visual scenes, it cannot effectively guide DBIE.
Recently, ClipDrag (Jiang, Wang, and Chen 2025) sought
to mitigate ambiguity by incorporating constraints from text
prompts. Nonetheless, as a form of high-level descriptions,
texts are often too vague to provide control signals required
by fine-grained image manipulation (Shi et al. 2024; Zhang,
Rao, and Agrawala 2023). For example, as shown in Figure 1,
even with the guidance of the prompt “rotate the cat’s head
around its left cheek as the central point”, ClipDrag still fails
to achieve the desired outcome. To boost DBIE’s efficiency,
FastDrag (Zhao et al. 2024) and RegionDrag (Lu, Li, and Han
2024) proposed using predefined mapping functions, rather
than the learnable alternating paradigm. However, unfortu-
nately, the warpage function and the copy-and-paste strategy
used in FastDrag and RegionDrag are not flexible enough
to handle all editing tasks and are prone to yielding unreal-
istic or unnatural results—such as the distorted cat’s head,
deformed handbell, and visible artifacts in the edited areas
shown in Figure 1—thus severely degrading image quality.
Point-based motion supervision and tracking are cum-
bersome and often difficult to align with users’ intentions,
whereas relying solely on the warpage function or the copy-
and-paste strategy is far from delivering high-quality results.
These observations naturally lead us to ask two questions:
7→Q1. Is there a more effective solution to solve the ambi-
guity issue? 7→Q2. How to enhance the efficiency of DBIE
approaches based on alternating point motion supervision
and tracking, while further improving their editing quality?
These two questions motivate us to revisit DBIE from a
new perspective—unifying it as a Latent Region Optimiza-
tion (LRO) problem, which aims to leverage region-level
geometric transformations to optimize latent embeddings
and realize drag manipulation. Therefore, by specifying the
regions and types of geometric transformations, we can ef-
fectively address Q1. Furthermore, we design a simple-yet-
effective editing framework, DragNeXt, to tackle Q2. For
efficiency, it unifies DBIE as LRO and thus eliminates the
necessity of conducting handle point tracking by upgrad-
ing point motion supervision to region-level optimization of
latent embeddings. For editing quality, we propose a Progres-
sive Backward Self-Intervention (PBSI) strategy that solves
LRO by fully leveraging region-level self-intervention from
intermediate drag states. By bypassing point tracking and
considering region-level guidance from intermediate states, it
can achieve a better trade-off between efficiency and quality.
Contribution Summary: (i) We propose to unify DBIE as an
LRO problem. Therefore, by specifying the regions and types
of geometric transformations, we can effectively resolve the
ambiguity issue. (ii) We propose a simple yet effective editing
framework, DragNeXt, which tackles DBIE via LRO and
further enhancing editing quality via performing PBSI. (iii)
We introduce NextBench, a benchmark with explicit user-
intention annotations for evaluating alignment between user
expectations and edited results. (iv) On NextBench, the exten-
sive experiments demonstrate that our DragNeXt can achieve
a better trade-off between editing efficiency and quality.
Related Work
DragDiffusion (Shi et al. 2024) is the first work using dif-
fusion models to achieve DBIE, which followed (Pan et al.
2023) and conducted motion supervision and point tracking
alternately. Based on (Shi et al. 2024), GoodDrag (Zhang
et al. 2024) further enhanced the fidelity of dragged areas
by rearranging the drag process across multiple denoising
timesteps. DragText (Choi et al. 2024) proposed refining text
embeddings to avoid drag halting. DragonDiffusion (Mou
et al. 2023) and DiffEitor (Mou et al. 2024) discarded the
tracking phase and directly applied point motion supervision
between initial handle points and target points. To estimate
handle point positions more accurately, StableDrag (Cui et al.
2024) proposed a discriminative point tracking strategy, and
FreeDrag (Ling et al. 2024) designed a line search back-
tracking mechanism. EasyDrag (Hou et al. 2024) advanced
(Shi et al. 2024) via introducing a stable motion supervision,
which is beneficial for improving the quality of final results.
FastDrag (Zhao et al. 2024) and RegionDrag (Lu, Li, and
Han 2024) improved the efficiency of DBIE by employing
fixed predefined mapping functions, where (Lu, Li, and Han
2024) is based on copy-and-paste and thus requires users to
specify both handle and target areas. ClipDrag (Jiang, Wang,
and Chen 2025) reduced the ambiguity of DBIE via using
text prompts. DragNoise (Liu et al. 2024) proposed editing
on UNet’s bottleneck features, which inherently contain more
semantic information and can stabilize dragging.
REMARK 1. Our method differs from DragDiffusion, Good-
Drag, DragText, StableDrag, FreeDrag, EasyDrag, and Drag-
Noise fundamentally, since it does not rely on alternating
between point motion supervision and tracking. Instead of
only considering initial relationships between handle and tar-
get points, as in DragonDiffusion and DiffEditor, we fully
leverage progressive region-level guidance from intermediate
drag states. CLIPDrag and RegionDrag overlook the ambigu-
ity issue arising from the type of geometric transformations,
whereas our approach does not rely on texts to reduce ambi-
guity. Different from FastDrag and RegionDrag, we only use
geometric mapping functions to provide interventional signal
and our learnable backward self-intervention strategy can
fully leverage inherent prior knowledge learned by diffusion
models via back-propagated gradients, avoiding unnatural
deformation led by using a fixed transformation pattern.
Methodology
Preliminaries
Diffusion Models. Diffusion models (Ho, Jain, and Abbeel
2020; Rombach et al. 2022; Dhariwal and Nichol 2021)
are composed of a diffusion process and a reverse process.
During the diffusion, an image x is encoded into latent
space z0 and undergoes a gradual addition of Gaussian
noise, q(zt|z0) = N (
√
αtz0, (1 − αt)I), where αt is a
non-learnable parameter and decreases . the timestep
t. The reverse process is to recover z0 from zT by training a
denoiser εΘ(·):
LΘ = Et∼U(1,T ),εt∼N (0,I)
[
||εt − εΘ(zt; t, c)||2
]
(1)
where εt denotes the groundtruth noise in the timestep t, and
c represents an extra condition. Following the prior works
(Liu et al. 2024; Shi et al. 2024; Zhang et al. 2024), we em-
ploy DDIM (Song, Meng, and Ermon 2020) in our approach.
Drag-Based Image Editing. Given n pairs of handle and
target points O = {hi = (xhi , y
h
i ), gi = (x
g
i , y
g
i )}i=1,...,n,
DBIE aims to edit an image x by dragging objects or regions
indicated by handle points to target ones. Usually, an extra
binary mask M is used to specify the uneditable region of x.
Motion Supervision and Point Tracking. Current DBIE
methods (Shi et al. 2024; Zhang et al. 2024; Choi et al. 2024;
Cui et al. 2024; Ling et al. 2024; Hou et al. 2024; Liu et al.
2024) mainly rely on performing motion supervision and
point tracking alternately, where the former aims to transfer
the features of handle points to target positions while the latter
updates handle points iteratively and prevents dragging halt.
We use Fhi/gi(zt) to denote the features extracted by εΘ(·)
at the location hi or gi. Therefore, the objective function of
the motion supervision can be described by Equation (2):
Lm(zkt ) =
n∑
i=1
∑
q∈π(hk
i
)
||Fq+di(z
k
t )−SG(Fq(z
k
t ))||1+RM
(2)
where zkt and h
k
i denote the latent code zt and the handle
point hi updated by k iterations, di = (gi−hki )/||gi−h
k
i ||2
is the normalized vector from hki to gi, π(h
k
i ) denotes the
neighborhood of hki , SG(·) stops gradients from being back-
propagated to variables, and RM is a constraint term to
ensure the consistency of uneditable regions. After the motion
supervision in each iteration k, point tracking is performed:
hk+1i = argmin
q∈π(hk
i
)
||Fq(zk+1t )−Fhi(zt)||1 (3)
where Fhi(zt) indicates the features of the initial handle
point hi in the original latent code zt.
REMARK 2. 7→Why is point tracking critical for motion-
based methods? Point-based motion supervision is too local
to provide enough guidance for the whole editing procedure.
Losing the positions of handle points will severely interrupt
the drag process, as no alternative guidance for editing re-
mains, thereby significantly damaging the quality of edited
images. 7→What are limitations of methods based on mo-
tion supervision and point tracking? Firstly, although the
use of point tracking can alleviate the inherent limitation of
point-based motion supervision, it is still very challenging
to precisely estimate the updated positions of handle points.
Inaccurate coordinate estimation can significantly mislead
the dragging process, resulting in suboptimal outcomes. Sec-
ondly, the alternating execution of Equation (2) and (3) results
in low efficiency of DBIE, since motion supervision is re-
peatedly disrupted by iterative point tracking. Thirdly, point
motion supervision suffers from high ambiguity and easily
leads to gaps between user expectations and actual results.
Last but not least, while using dense points can reduce ambi-
guity in some situations, this will substantially decrease the
efficiency of the alternating workflow, as shown in Figure 9.
Latent Region Optimization for Reliable
Drag-Based Image Editing
We begin by outlining key factors behind the ambiguity issue
of DBIE in Proposition 1, which can be further summarized
(a) Factor-1: translation or deformation?
(b) Factor-2: move the nose or head?
How to drag?
What to drag?
Drag input Result-2Result-1
Drag input Result-1 Result-2
deformation translation
move nose move head
Figure 2: Factor-1 and -2.
(a) Handle Factor-1: Clarify the drag type.
(b) Handle Factor-2:Specify the drag region.
Original Rethink-1 Rethink-2
Original Rethink-1 Rethink-2
How to drag?
What to drag?
deformation translation
nose head
I know! I know!
I know! I know!
Figure 3: Rethink DBIE.
as two key questions: how to drag? and what to drag?
Proposition 1 (Key Factors to Ambiguity). The ambiguity
of DBIE is twofold: 7→Factor-1. drag operations inherently
involve multiple types—such as translation, deformation, and
rotation—and treating them as type-agnostic induces ambigu-
ity about users’ intentions (how to drag?); 7→Factor-2. point
indicators are insufficient for accurately specifying objects
or regions that need to be dragged (what to drag?).
In Figure 2, we provide an illustration for the two key
factors, Factor-1 and Factor-2. On one hand, the drag opera-
tion in Figure 2(a) is inherently ambiguous since it could be
interpreted as either a translational movement of the cup or a
deformation of its edge region. This ambiguity stems from
uncertainty about the types of drag operations (how to drag?),
which inevitably increases gaps between user expectations
and model behaviors, thus damaging the precision of the
editing process. On the other hand, in Figure 2 (b), the drag
instruction could be either dragging the raccoon’s nose, its
head, or even its whole body. This type of ambiguity arises
from uncertainty about which areas or objects to drag (what
to drag?) since points are too ambiguous to clearly reflect
users’ intentions. How to drag and what to drag are two fun-
damental problems in DBIE. Although textual description
appears to be a shortcut, it actually does not work well as
exemplified in Figure 1. We argue there is No Free Lunch in
resolving these ambiguity issues, which means it is necessary
to enable models to perceive drag operation types and areas
in a more explicit way and design a more effective approach
to guide them toward producing user-intended results.
REMARK 3. 7→Some previous methods have noticed the
ambiguity issue in DBIE, but few of them consider both
Factor-1 and Factor-2, or provide a systematic and clear
analysis for this problem, which we believe is critical and
valuable for inspiring the further development of DBIE.
Based on the above observations, we step-by-step intro-
duce how to explore DBIE from a new perspective, .,
unifying DBIE as a Latent Region Optimization (LRO)
problem. We first rethink the DBIE task in Proposition 2.
Proposition 2 (Rethink DBIE). DBIE can be regarded as
performing geometric transformations on user-specified re-
gions of images.
For instance, Result-1 given in Figure 2 (a) can be seens as
applying a deformation transformation to the white coffee
Target
region 𝝆𝑗
Handle
region 𝝑𝑗
Target region 𝝆𝑖
Handle
region
𝝑𝑖
(a) Rotation (b) Translation
Translation matrix
handle point 𝐡i
central point 𝐜i
handle point 𝐡i
target point 𝐠i
Figure 4: Examples of estimating target regions.
cup region. Form this perspective, by specifying the regions
and types of geometric transformations, we can effectively
resolve the ambiguity issue caused by Factor-1 and Factor-2,
because of clarifying both how and what to drag as illustrated
in Figure 3.
Based on Proposition 2, we further give the definition
of our region-level Reliable Drag-based Image Editing
(Reliable DBIE) in Definition 1, which aims to help users to
yield reliable editing results and narrow gaps between their
expectations and actual outcomes.
Definition 1 (Reliable DBIE). Reliable DBIE is to manip-
ulate user-specified regions E = {ϑi}i=1,...,n of an image
x based on the corresponding geometric transformations
Γ = {fi}i=1,...,n inferred from instructions given by users.
Currently, the editing process is primarily performed on noise
latent embeddings encoded by diffusion models, as they are
more editable than original images (Mokady et al. 2023; Ruiz
et al. 2023). Thus, we can further extend Definition 1 to Def-
inition 2, and unify DBIE as a Latent Region Optimization
(LRO) problem.
Definition 2 (Unify DBIE as LRO). DBIE can be unified as
optimizing specific target regions P = {ρi}i=1,...,n within
a latent code zt based on user-specified handle regions
E = {ϑi}i=1,...,n and the corresponding geometric transfor-
mations Γ = {fi}i=1,...,n involved in user instructions:
z∗t = argmin
zt
LLRO (zt, {ρi}i=1,...,n) (4)
., {ρi}i=1,...,n = δ(E,Γ)
where LLRO is the objective function of LRO, and δ(·) aims
to produce binary masks {ρi}i=1,...,n to identify target re-
gions required to be optimized in zt according to E and Γ.
REMARK 4. 7→Why LRO? LRO serves as a bridge between
DBIE and region-level geometric transformations. Therefore,
we can leverage many well-studied geometric transformation
functions in computer graphics to realize more reliable DBIE
via explicitly controlling the dragging process. 7→What can
LRO do? Different from methods based on alternating point
motion supervision and tracking, LRO takes into account
region-level visual information, which provides more robust
guidance for latent code manipulation. Under such regional
supervision, it is unnecessary to excessively focus on posi-
tions of some specific points, as there exists sufficient context
information to guide dragging. 7→How to estimate target
latent regions? Target latent regions are estimated using geo-
metric transformation functions widely adopted in computer
graphics. As exemplified in Figure 4, if users intend to rotate
the handle region ϑi, the region can be multiplied by a rota-
tion matrix to achieve the desired geometric transformation.
We observe that rotation and translation geometric trans-
formations can cover most DBIE scenarios. We reformu-
late DBIE’s user input: users specify a set of handle regions
E = {ϑi}i=1,...,n for an input image and give the correspond-
ing drag instructions C = {Ti,Oi}i=1,...,n. If the operation
type Ti =“rotation”, Oi = {hi, gi, ci} where hi and gi
denote a pair of a handle point and a target point, and ci
represents a rotation center of ϑi; otherwise, Oi = {hi, gi}.
Also, a binary mask M is adopted to specify the uneditable
region. Based on {hi, gi} or {hi, gi, ci}, the transformation
function fi can be constructed by determining the correspond-
ing rotation and translation matrix. For details on converting
input points to fi, please refer to the supplementary material.
REMARK 5. DBIE via regional geometric transforma-
tions. Object movement can be achieved by translating an
object’ entire region; deformation can be realized by trans-
lating only its edge region; 2D rotation can be achieved by
applying a rotation transformation; and 3D rotation can be
interpreted as translating the sub-region of an object, assisted
by prior knowledge inherently learned in diffusion models (as
shown in Figure 7 (j), the car’s 3D rotation can be realized by
translating its front to the right). Explicit geometric functions
and region-level guidance can help achieve better DBIE.
Progressive Backward Self-Intervention: Less Meets
More!
Based on Definition 2, we further design DragNeXt to en-
hance both editing quality and efficiency. As mentioned be-
fore, the alternating workflow lowers the efficiency of DBIE,
while inaccurate handle point tracking easily leads to drag-
ging halt and makes results unsatisfactory. Therefore, Drag-
NeXt addresses DBIE from an LRO perspective, eliminating
the need for KNN-based point tracking by explicitly advanc-
ing point-based motion supervision to region-level optimiza-
tion of latent embeddings. Moreover, it employs a Progressive
Backward Self-Intervention (PBSI) strategy, which does not
require accurately tracking point positions but still achieves
superior editing results by fully leveraging progressive region-
level guidance from intermediate transformation states.
Progressive Backward Self-Intervention. Figure 5 gives
a brief illustration of our approach. Given an input image
x, we first encode it into latent space and perform DDIM
inversion to produce zT . Then, PBSI is conducted from T to
T ′ during denoising with K iterations per timestep. We take
the handle region ϑi at the k-th iteration of the timestep t as
an example to illustrate PBSI. We first extract the features of
zkt by concatenating outputs from the last upsample blocks
of all stages of εΘ(zkt ) and upsampling them to half of the
resolution of x, denoted as F(zkt ). Then, we estimate the
intermediate transformation state ρt,ki for the handle region
ϑi within the extracted features F(zkt ) based on user-given
conditions C, which can be described by Equation (5):
ρ
t,k
i ,Πϑi→ρt,ki
= δ(ϑi, f
t,k
i ) (5)
., δ(ϑi, f
t,k
i ) =
{
Rot(ϑi, ci, θ), if Ti =“rotation”
Trans(ϑi,ω), else.
V
A
E
E
n
co
d
e
r
𝒛𝑇
Input image
…
𝒛𝑡
V
A
E
D
e
co
d
e
r
Output image
…
*Note: for brevity, the drag
instruction is visualized in 𝒙.
𝒄𝑖
𝒈𝑖
𝒉𝑖
𝝑𝑖
𝐾 iterations
In
v
e
rs
io
n
𝒛0
If 𝑇′ < 𝑡 < 𝑇, execute PBSI; otherwise, plain denoising!
𝒛𝑡
𝑘
UNet
Upsample
ℱ ⋅
Extract
features ()
Self-intervention
Intermediate
LRO
b
a
c
k
w
a
rd
𝒛𝑡
𝑘+1
𝝑𝑖
𝝆𝑖
𝑘
Progressive Backward Self-Intervention
−
𝜕ℒ𝐿𝑅𝑂
𝜕𝒛𝑡
𝑘
UNet
𝑘-th iteration
Rotation
Translation
DeformationType
if k=0
Concat
Figure 5: A brief illustration of our DragNeXt.
In the equation, Rot(ϑi, ci, θ) aims to rotate the handle
region ϑi around the center point ci by an angle θ =
ηt,k ∗∠hicigi, Trans(ϑi,ω) translates ϑi according to the
offset vector ω = ηt,k ∗ (gi − hi), and ηt,k =
K∗(T−t)+k
K∗(T−T ′+1)
is a weighting factor that determines angles or offsets of
intermediate states. Also, ρt,ki is a binary mask that identi-
fies the target intermediate region in F(zkt ), and Πϑi→ρt,ki
represents the coordinate mapping from the handle region
ϑi to the intermediate state ρ
t,k
i . Finally, we copy and de-
tach the features extracted from the original latent code,
F ′(zt) = F(zt).(). Moreover, we inter-
ventionally adjust the detached features according to the ob-
tained coordinate mapping, F ′(zt)[Πϑi→ρt,ki ], thereby per-
turbing the original latent representations and transforming
the features of the handle region ϑi to the intermediate target
position ρt,ki . We consider self-intervention from the per-
turbed features to F(zkt ) and back-propagate interventional
signal to the latent code zkt along the denoiser to update latent
features. This can be depicted by Equation (6) and (7):
zk+1t ←− z
k
t −
∂LLRO
∂zkt
, (6)
LLRO = ∥F(zkt )∗ρ
t,k
i −F
′(zt)[Πϑi→ρt,ki
]∗ρt,ki ∥1+RM .
(7)
Minimizing LLRO back-propagates self-intervention gradi-
ents to the latent code, thus progressively dragging handle re-
gions to target positions. Once PBSI is complete, we denoise
zT ′ to z0 and decode it into image space. The pseudocode of
our DragNeXt is provided in Algorithm 1 of the appendix.
REMARK 6. 7→Why backward self-intervention? Our
method also adopts geometric mapping functions. However,
unlike RegionDrag and FastDrag, which directly use them
to manipulate latent code, we instead leverage them to pro-
vide interventional signal. By optimizing latent code through
back-propagated gradients from the denoiser, our approach
fully exploits the prior of pretrained diffusion models, thereby
mitigating unnatural results caused by fixed mapping func-
tions. 7→Difference between Equation (7) and (2). LLRO
considers region-level guidance, whereas Lm performs point
DragBench Drag100 Our NextBench
*User expectation: N/A. Rotation
or translation? Ambigous!
*User expectation: Move the
cat’ face down.
Unambiguous
Handle region
Editable region
Figure 6: Comparison between our NextBench and the previ-
ous benchmarks, DragBench and Drag100.
supervision and needs to iteratively track handle point posi-
tions. 7→Discussion on Equation (5). We unify translation,
deformation, and 3D rotation into a single mapping function,
Trans(·), as deformation and 3D rotation can be interpreted
by translating the partial regions of an object, assisted by
priors inherently encoded in diffusion models, ., in Fig-
ure 7 (j), the car’s 3D rotation can be achieved by translating
its front to the right. We extend drag points to regional guid-
ance to more clearly specify regions to move, deform, or
rotate.
Experiments
We first introduce our NextBench and evaluation metrics,
followed by the main results of our method and ablation stud-
ies. We provide implementation details in the supplementary
material due to the limited space of the paper’s main body.
NextBench: a Benchmark for Reliable DBIE
To better evaluate model performance on Reliable DBIE,
we propose a new benchmark, NextBench, that comprises
234 test samples with drag operations including translation,
2D/3D rotation, and deformation. Each sample is clearly an-
notated with user intentions to better assess how well model
outputs align with user expectations, as shown in Figure 6.
Why NextBench? Existing benchmarks, such as DragBench
(Shi et al. 2024) and Drag100 (Zhang et al. 2024), still con-
tain ambiguous drag instructions, ., as shown in Figure 6,
the raccoon’s and dog’s heads could either be translated or
rotated, and both would be considered as satisfactory results.
ClipDrag RegDragDrag input FastDragDragDiff GoodDrag
27s
27s
38s
38s 2s 6s31s 34s
48s
32s
52s
62s
112s
63s
2s
2s
8s
6s
User intentions
Lower the height of the desk
lamp.
(b)
Rotate the cat’s head around its
left cheek as the central point.
(c)
Drag the root of the tree and
widen the road.
(f)
DragNeXt DiffEditor
20s
21s
20s
Please move the potted plant to
the left.
(a) 25s 32s 20s 45s 71s 8s2s
Rotate the sculpture to the left
around its nose.
(e)
26s 27s 43s20s 7s51s 2s
Drag the face to make it face
straight ahead
(i)
FreeDrag DragNoise
25s
28s
27s 31s 44s 71s 2s 8s
Rotate the raccoon’s head around
its chin as the central point.
(d)
21s
32s27s
27s 32s
32s
35s
20s
20s
20s
20s
32s
45s
44s
Drag the car’s front to rotate it
toward the right side.
(j)
Extend the bird's golden feathers.
(g)
Drag the cup to make it taller.
(h)
42s 53s
42s
58s
72s
2s 6s
6s
7s
7s
2s
2s
2s
60s 30s
45s121s
38s 28s
28s61s
29s48s
33s71s
29s27s
28s45s
28s46s
29s108s
Figure 7: Qualitative results achieved by our DragNeXt.
Our NextBench explicitly annotates each sample with a clear
user expectation, handle regions, and editable areas, therefore
enabling a more reliable assessment of intention–result align-
ment and regional consistency. NextBench also treats 3D and
2D rotations as two distinct operations, showing that current
approaches struggle with 2D rotation and excel at yielding
3D rotation based on the prior of pretrained diffusion models.
Evaluation Metrics
Following prior work (Zhang et al. 2024; Shi et al. 2024), we
use LPIPS and DAI to evaluate performance on NextBench,
where the radius of DAI is set as 20. To better assess region-
level DBIE, LPIPS is computed between original images
and edited results in three parts: (i) LPIPSue for uneditable
regions (a lower LPIPSue indicates better preservation of
uneditable regions); (ii) LPIPSth for consistency between
handle and target regions (a lower LPIPSth means handle
regions are successfully dragged to target positions); (iii)
LPIPShh for handle regions (Successful dragging handle
regions to target positions should result in a higher LPIPShh
between the handle regions of original input images and
edited results, reflecting the change of visual content).
Main results
We compare our method with eight typical open-source DBIE
methods: DragDiffusion, DiffEditor, DragNoise, FreeDrag,
Method Lat↓ DAI↓ LPIPSue ↓ LPIPSth ↓ LPIPShh ↑
DragDiff 36s
DiffEditor 24s
DragNoise 34s
FreeDrag 69s
GoodDrag 51s
ClipDrag 58s
RegionDrag 3s
FastDrag 8s
DragNeXt 28s
Table 1: Quantitative results on NextBench. “Lat” indicates
mean latency per image for dragging handle point regions to
target positions. ↑ / ↓ denotes higher/lower values are better.
GoodDrag, ClipDrag, RegionDrag, and FastDrag.
Qualitative Results. We present qualitative results of our
method in Figure 7, from which we can make the follow-
ing observations. By specifying the regions and the types of
geometric transformations, DragNeXt achieves better align-
ment between user expectations and results, ., as shown
in Figure 7 (a), our models translate the potted plant left-
ward without damaging its shape. Based on LRO and PBSI,
DragNeXt achieves a better trade-off between efficiency and
quality, ., it has obviously higher efficiency than methods
based on alternating point motion supervision and tracking,
while delivering obviously higher editing quality than those
relying solely on predefined mapping functions.
84%
3%
4%
5%4%
DargNeXt
ClipDrag
RegionDrag
FastDrag
None of the results is satisfying
Figure 8: Voting results.
25s
27s
38s
Move the hand holding a
metallic-copper handbell.
Rotate the cat’s head
around its left cheek as the
central point.
Lower the height of the
desk lamp.
GoodDrag ClipDrag GoodDrag ClipDrag6 Points 12 Points
52s 72s
142s73s
83s 87s
61s 73s
133s 140s
43s 49s
User intentions Our DragNeXt
(a)
(b)
(c)
GoodDrag ClipDrag2 Points
33s
52s
34s
51s
112s
43s
Figure 9: Efficiency and quality improvements over the point-based alternating workflow.
REMARK 7. Discussion on 2D/3D rotation. 7→We observe
that existing DBIE methods particularly excel at yielding 3D
rotation effects, ., as shown in Figure 7 (i) and (j), drag-
ging the face or the vehicle front enables most methods to
rotate them. This capability arises from the strong prior of
pretrained diffusion models, which are creative to generate
rotated objects based on latent features perturbed by drag op-
erations. Among these methods, DragNeXt leverages region-
level visual cues, which can obviously guide diffusion models
to realize better 3D rotation effects. 7→Current DBIE meth-
ods generally fail to perform 2D rotation, as these patterns
are not well captured and learned by diffusion models. Drag-
NeXt can mitigate this by explicitly using regional geometric
transformations.
Quantitative Results. The quantitative results are summa-
rized in Table 1. The table shows that our method achieves the
lowest DAI and LPIPSth and the highest LPIPShh, demon-
strating that our method can effectively drag objects from
handle regions to target positions. Unsuccessfully dragging
objects to target positions results in low LPIPShh—indicating
little change in handle regions—and high LPIPSth due to
mismatch between original handle regions and edited target
regions. Also, the value of LPIPSue indicates that our method
can preserve the high fidelity of uneditable areas.
Anonymous User Study. We further conducted user studies
to validate our method, where 26 participants are invited. Se-
lecting the most appropriate result from too many options is
time-consuming; to reduce volunteers’ workload, we include
the three most relevant methods: ClipDrag, RegionDrag, and
FastDrag. The results consistently demonstrate the superior-
ity of DragNeXt, ., 84% of the votes favored our results,
demonstrating higher quality and better alignment with user
expectations (see the supplementary material for details).
Method Analysis
Efficiency and Quality Improvements. In Figure 9, we ana-
lyze the efficiency and quality improvements of our method
over the point-based alternating workflow. Without losing
generality, we choose two recent typical works—GoodDrag
and ClipDrag—as compared baselines. The results consis-
tently validate the effectiveness of our method. The point-
based methods are often difficult to align with user intentions,
., as shown in Figure 9 (a), GoodDrag and ClipDrag drag
only the hand and leave the handbell unmoved using 2 points.
Although increasing the number of points can mitigate am-
biguity, this largely slows down the alternating drag process
Drag
input
Full
PBSI
w/o
Inter PBSI:1
25s
27s
38s
25s 13s
27s 13s
38s 18s
Move the hand
holding a metallic-
copper handbell.
Rotate the cat’s
head around its left
cheek as the central
point.
Lower the height
of the desk lamp.
PBSI:3 PBSI:5
19s
19s
32s
PBSI:7
32s
32s
51s
25s
27s
38s
User intentions
(a)
(b)
(c)
Figure 10: Ablation study on PBSI. “Full PBSI” indicates us-
ing the full PBSI strategy, “w/o inter” represents the guidance
from intermediate drag states is not considered in PBSI, and
“PBSI: N” indicates that PBSI is performed over N timesteps.
Zoom in for a better view.
and fail to always guide models to yield satisfactory results.
Ablation Study on PBSI. We provide ablation studies for
our PBSI strategy in Figure 10. Based on the results shown in
the figure, we have the following observations. Firstly, remov-
ing the guidance of intermediate states significantly degrades
output quality, ., the hand in Figure 10 (a) is dragged to
an incorrect position, and unnatural results are yielded in
Figure 10 (b), thereby demonstrating its important role in
achieving high-quality results of DBIE. We also study the im-
pact of performing PBSI over different numbers of timesteps.
When PBSI is applied to only a single denoising timestep,
objects cannot be successfully dragged to target positions.
By contrast, increasing the timesteps of performing PBSI
obviously improves the quality of edited results, saturating
after 5 timesteps, which also indicate the effectiveness of our
method in guiding diffusion models to achieve DBIE.
Conclusion
We propose to address Drag-Based Image Editing (DBIE)
from a new perspective—unifying it as a Latent Region Op-
timization (LRO) problem that aims to use region-level ge-
ometric transformations to optimize latent code to realize
drag-based manipulation. By specifying the areas and types
of geometric transformations, we can effectively reduce gaps
between users’ intentions and actual model behaviors. We
also design a new simple-yet-effective editing framework,
dubbed DragNeXt. It solves LRO through a Progressive
Backward Self-Intervention (PBSI), which simplifies the pro-
cedure of DBIE while further enhancing editing quality by
fully leveraging region-level structure information and pro-
gressive guidance from intermediate transformation states.
Physically driven DBIE remains highly challenging. There-
fore, in the future, we plan to enhance our DragNeXt by in-
tegrating physics-based geometric transformation functions.
Acknowledgments
This research is supported by the RIE2025 Industry Align-
ment Fund – Industry Collaboration Projects (IAF-ICP)
(Award I2301E0026), administered by A*STAR, as well as
supported by Alibaba Group and NTU Singapore through
Alibaba-NTU Global e-Sustainability CorpLab (ANGEL).
References
Brooks, T.; Holynski, A.; and Efros, A. A. 2023. Instruct-
pix2pix: Learning to follow image editing instructions. In
Proceedings of the IEEE/CVF conference on computer vision
and pattern recognition, 18392–18402.
Cao, M.; Wang, X.; Qi, Z.; Shan, Y.; Qie, X.; and Zheng, Y.
2023. Masactrl: Tuning-free mutual self-attention control for
consistent image synthesis and editing. In Proceedings of
the IEEE/CVF international conference on computer vision,
22560–22570.
Choi, G.; Jeong, T.; Hong, S.; and Hwang, S. J. 2024. Drag-
Text: Rethinking Text Embedding in Point-based Image Edit-
ing. arXiv preprint arXiv:.
Chung, J.; Hyun, S.; and Heo, J.-P. 2024. Style injection
in diffusion: A training-free approach for adapting large-
scale diffusion models for style transfer. In Proceedings of
the IEEE/CVF conference on computer vision and pattern
recognition, 8795–8805.
Cui, Y.; Zhao, X.; Zhang, G.; Cao, S.; Ma, K.; and Wang,
L. 2024. StableDrag: Stable dragging for point-based image
editing. In European Conference on Computer Vision, 340–
356. Springer.
Dhariwal, P.; and Nichol, A. 2021. Diffusion models beat
gans on image synthesis. Advances in neural information
processing systems, 34: 8780–8794.
Giannone, G.; Nielsen, D.; and Winther, O. 2022. Few-Shot
Diffusion Models. In NeurIPS 2022 Workshop on Score-
Based Methods.
Hertz, A.; Mokady, R.; Tenenbaum, J.; Aberman, K.;
Pritch, Y.; and Cohen-Or, D. 2022. Prompt-to-prompt im-
age editing with cross attention control. arXiv preprint
arXiv:.
Ho, J.; Jain, A.; and Abbeel, P. 2020. Denoising diffusion
probabilistic models. Advances in neural information pro-
cessing systems, 33: 6840–6851.
Hou, X.; Liu, B.; Zhang, Y.; Liu, J.; Liu, Y.; and You, H. 2024.
Easydrag: Efficient point-based manipulation on diffusion
models. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 8404–8413.
Hu, E. J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.;
Wang, L.; Chen, W.; et al. 2022. Lora: Low-rank adaptation
of large language models. ICLR, 1(2): 3.
Jiang, Z.; Wang, Z.; and Chen, L. 2025. CLIPDrag: Com-
bining Text-based and Drag-based Instructions for Image
Editing. In Proceedings of the Thirteenth International Con-
ference on Learning Representations.
Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochas-
tic optimization. arXiv preprint arXiv:.
Liao, B.; Wang, X.; Zhu, L.; Zhang, Q.; and Huang, C. 2025.
Vig: Linear-complexity visual sequence learning with gated
linear attention. In Proceedings of the AAAI Conference on
Artificial Intelligence, volume 39, 5182–5190.
Ling, P.; Chen, L.; Zhang, P.; Chen, H.; Jin, Y.; and Zheng,
J. 2024. Freedrag: Feature dragging for reliable point-based
image editing. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, 6860–6870.
Liu, H.; Xu, C.; Yang, Y.; Zeng, L.; and He, S. 2024. Drag
your noise: Interactive point-based editing via diffusion se-
mantic propagation. In Proceedings of the IEEE/CVF confer-
ence on computer vision and pattern recognition, 6743–6752.
Lu, J.; Li, X.; and Han, K. 2024. Regiondrag: Fast region-
based image editing with diffusion models. In European
Conference on Computer Vision, 231–246. Springer.
Mokady, R.; Hertz, A.; Aberman, K.; Pritch, Y.; and Cohen-
Or, D. 2023. Null-text inversion for editing real images using
guided diffusion models. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, 6038–
6047.
Mou, C.; Wang, X.; Song, J.; Shan, Y.; and Zhang, J. 2023.
Dragondiffusion: Enabling drag-style manipulation on diffu-
sion models. arXiv preprint arXiv:.
Mou, C.; Wang, X.; Song, J.; Shan, Y.; and Zhang, J. 2024.
Diffeditor: Boosting accuracy and flexibility on diffusion-
based image editing. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, 8488–
8497.
Pan, X.; Tewari, A.; Leimkühler, T.; Liu, L.; Meka, A.; and
Theobalt, C. 2023. Drag your gan: Interactive point-based
manipulation on the generative image manifold. In ACM
SIGGRAPH 2023 conference proceedings, 1–11.
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; and Om-
mer, B. 2022. High-resolution image synthesis with latent dif-
fusion models. In Proceedings of the IEEE/CVF conference
on computer vision and pattern recognition, 10684–10695.
Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.;
and Aberman, K. 2023. Dreambooth: Fine tuning text-to-
image diffusion models for subject-driven generation. In
Proceedings of the IEEE/CVF conference on computer vision
and pattern recognition, 22500–22510.
Shi, Y.; Xue, C.; Liew, J. H.; Pan, J.; Yan, H.; Zhang, W.;
Tan, V. Y.; and Bai, S. 2024. Dragdiffusion: Harnessing
diffusion models for interactive point-based image editing.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 8839–8849.
Song, J.; Meng, C.; and Ermon, S. 2020. Denoising diffusion
implicit models. arXiv preprint arXiv:.
Sun, H.; Li, W.; Liu, J.; Chen, H.; Pei, R.; Zou, X.; Yan, Y.;
and Yang, Y. 2024. Coser: Bridging image and language for
cognitive super-resolution. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
25868–25878.
Tan, W.; Chen, S.; and Yan, B. 2023. Diffss: Diffusion
model for few-shot semantic segmentation. arXiv preprint
arXiv:.
Wu, R.; Yang, T.; Sun, L.; Zhang, Z.; Li, S.; and Zhang,
L. 2024. Seesr: Towards semantics-aware real-world image
super-resolution. In Proceedings of the IEEE/CVF conference
on computer vision and pattern recognition, 25456–25467.
Zhang, L.; Rao, A.; and Agrawala, M. 2023. Adding condi-
tional control to text-to-image diffusion models. In Proceed-
ings of the IEEE/CVF international conference on computer
vision, 3836–3847.
Zhang, Y.; Huang, N.; Tang, F.; Huang, H.; Ma, C.; Dong, W.;
and Xu, C. 2023. Inversion-based style transfer with diffusion
models. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, 10146–10156.
Zhang, Z.; Liu, H.; Chen, J.; and Xu, X. 2024. GoodDrag: To-
wards good practices for drag editing with diffusion models.
arXiv preprint arXiv:.
Zhao, K.; Shi, J.; Zhu, B.; Zhou, J.; Shen, X.; Zhou,
Y.; Sun, Q.; and Zhang, H. 2025. Real-Time Motion-
Controllable Autoregressive Video Diffusion. arXiv preprint
arXiv:.
Zhao, X.; Guan, J.; Fan, C.; Xu, D.; Lin, Y.; Pan, H.; and
Feng, P. 2024. FastDrag: Manipulate anything in one step.
arXiv preprint arXiv:.
Zhou, J.; Zhou, Y.; Zhao, K.; Xu, Q.; Zhu, B.; Hong, R.;
and Zhang, H. 2025a. Streaming Drag-Oriented Interac-
tive Video Manipulation: Drag Anything, Anytime! arXiv
preprint arXiv:.
Zhou, Y.; Chen, X.; Guo, Y.; Yu, J.; Hong, R.; and Tian, Q.
2024. Advancing incremental few-shot semantic segmenta-
tion via semantic-guided relation alignment and adaptation.
In International Conference on Multimedia Modeling, 244–
257. Springer.
Zhou, Y.; Xu, Q.; Cui, J.; Zhou, J.; Zhang, J.; Hong, R.; and
Zhang, H. 2025b. CARE Transformer: Mobile-Friendly Lin-
ear Visual Transformer via Decoupled Dual Interaction. In
Proceedings of the Computer Vision and Pattern Recognition
Conference, 20135–20145.
Zhu, L.; Huang, Z.; Liao, B.; Liew, J. H.; Yan, H.; Feng, J.;
and Wang, X. 2025. Dig: Scalable and efficient diffusion
models with gated linear attention. In Proceedings of the
Computer Vision and Pattern Recognition Conference, 7664–
7674.
A. DDIM Sampling and Inversion
In this section, we provide more details about DDIM (Song,
Meng, and Ermon 2020), which is employed in our editing
framework. DDIM defines the sampling of diffusion models
as a non-Markovian process:
q(zt−1|zt, z0) = (8)
N
(
√
αt−1z0 +
√
1− αt−1 − σ2t ·
zt −
√
αtz0√
1− αt
, α2tI
)
.
Therefore, it can be formulated by using Equation (9):
zt−1 =
√
αt−1
(
zt −
√
1− αt εΘ(zt)√
αt
)
(9)
+
√
1− αt−1 − σ2t · εΘ(zt) + σtε,
where ε ∼ N (0, I) represents standard Gaussian noise
and is independent of the latent code zt, and σt =
η
√
(1− αt−1)/(1− αt)
√
1− αt/αt−1 for all timesteps.
When setting η = 1, Equation (9) becomes DDPM, equaling
to a stochastic differential equation (SDE). Setting η = 0
yields a deterministic sampling process, corresponding to an
ordinary differential equation (ODE). Given the sampling
process in Equation (9), DDIM inversion can be described by
Equation (10):
zt+1 =
√
αt+1
√
αt
(
zt −
√
1− αt · εΘ(zt)
)
(10)
+
√
1− αt+1 · εΘ(zt), (11)
which is based on the assumption that the ODE is invertible
in the limit of small step sizes.
B. Implementation Details
We implement our DragNeXt in PyTorch and, following
prior works (Zhang et al. 2024; Lu, Li, and Han 2024;
Zhao et al. 2024; Jiang, Wang, and Chen 2025), employ
Stable-Diffusion-v1-5 as the base model to ensure
fair comparison between methods. We optimize the learn-
able parameters using the Adam optimizer (Kingma and Ba
2014) with a learning rate of 2× 10−2. Following (Shi et al.
2024; Zhang et al. 2024), we finetune diffusion models via
LoRA (Hu et al. 2022) with a rank of 16. The number of
denoising timesteps is set to Tmax = 50, and the inversion
strength is fixed at , meaning that each input image is
inverted to the timestep T = 50 × = 38, respectively.
Also, T ′ and K are set to 33 and 10. Following (Shi et al.
2024; Zhang et al. 2024; Lu, Li, and Han 2024), we incorpo-
rate mutual self-attention (Cao et al. 2023) starting from the
10-th layer of UNet.
C. NextBench: a Benchmark for Reliable DBIE
To better assess model performance on Reliable DBIE, we
propose a new benchmark, NextBench, comprising 234 care-
fully annotated samples with detailed drag instructions and
corresponding user intentions. As illustrated in Figure 11 (a),
each drag instruction is specified by six key components:
handle regions, editable regions, center points, handle points,
target points, and transformation types. Annotators explic-
itly record their intentions for each sample, enabling a more
faithful evaluation of how well generated results align with
user expectations.
NextBench is the first benchmark to explicitly incorporate
constraints on both the type and region of geometric trans-
formations for dragging, serving as a critical step toward
realizing Reliable DBIE. To streamline data collection, we
developed a user-friendly web-based system, following the
pipeline illustrated in Figure 11 (b), which will be publicly
released soon. NextBench offers diverse content, including
200 real images and 34 AI-generated images, covering 103
animal images, 18 artistic paintings, 32 landscapes, 24 plant
images, 31 human portraits, and 26 everyday objects such as
furniture and vehicles. As a high-quality benchmark is essen-
tial for driving progress in this field, we are committed to the
continuous maintenance and improvement of NextBench.
Evaluation metrics. Following prior work (Zhang et al.
2024; Shi et al. 2024), we use LPIPS and DAI to evaluate
performance on NextBench. DAI aims to measure distance
between patches centered at handle points and those centered
at corresponding target points, which is first introduced by
GoodDrag (Zhang et al. 2024):
DAI =
1
n
n∑
i=1
∥∥ϕ(z0)π(hi;γ) − ϕ(ẑ0)π(gi;γ)∥∥22
(1 + 2γ)2
, (12)
where ϕ denotes the VAE decoder that maps z0 to the RGB
image space, and π(hi, γ) represents a patch centered at hi
with a radius γ. Equation (12) computes the mean squared
error between the patch at the handle point hi of ϕ(z0) and
the corresponding patch at the target point gi of ϕ(ẑ0).
To better assess region-level DBIE, LPIPS is computed
between original images and edited results in three parts: (i)
LPIPSue for uneditable regions, (ii) LPIPSth for consistency
between handle and target regions, and (iii) LPIPShh for
handle regions, which are shown in Equation (13), (15), and
(14), respectively:
LPIPSue =
1
m
m∑
j=1
LPIPS(xj [M ], x̄j [M ]) (13)
LPIPSth =
1
mn
m∑
j=1
n∑
i=1
LPIPS(xj [ϑi], x̄j [ρi]) (14)
LPIPShh =
1
mn
m∑
j=1
n∑
i=1
LPIPS(xj [ϑi], x̄j [ϑi]). (15)
In the above equations, LPIPS(·) measures LPIPS values
between input images, and [·] selects regions where given
binary masks have a value of 1. xj and x̄j represent a pair of
an original image and an edited result, {ϑi}i=1,...,n represent
handle regions given by users, and {ρi}i=1,...,n denote target
regions in edited results. Target regions can be estimated by
considering drag instructions given by users, as mentioned in
Definition 2 and Equation (5) of the paper’s main body.
What can these metrics do? According to Equation (13),
(14), (15), we summarize the functions of these metrics here:
• LPIPSue aims to measure LPIPS between the uneditable
regions of an original input image and an edited result. A
lower LPIPSue indicates better preservation of uneditable
regions, whereas a higher LPIPSue implies that uneditable
regions are altered during dragging.
• LPIPSth aims to measure the consistency between the
handle regions of an original input image and the target
regions of an edited result. A lower LPIPSth means handle
regions are successfully dragged to target positions; vice
versa, failing to drag objects to target positions results in
a higher LPIPSth due to the mismatch between original
handle regions and target areas in editable results.
• LPIPShh is to measure divergence between the handle
regions of an input image and an edited result. Successful
dragging handle regions to target positions should result in
a higher LPIPShh, reflecting the change of visual content;
otherwise, handle regions in the original and edited image
are the same, resulting in a lower LPIPShh.
Why not point-based metrics? We do not employ point-
based evaluation metrics in NextBench, such as the Mean
Distance (MD) between handle and target points (Shi et al.
2024), as they are incompatible with the region-based na-
ture of our proposed Reliable DBIE. Unlike the previous
point-based DBIE setting, Reliable DBIE emphasizes region-
level consistency, rendering point-based metrics biased and
insufficient for evaluating model performance in this context.
D. Translation, Deformation, and 2D/3D
Rotation
In this work, we adopt two geometric transformations widely
used in computer graphics: translation and rotation. We
observe that these two transformation functions can cover
most DBIE scenarios, including translation, deformation, and
2D/3D rotation. Theoretically, our approach does not have
restrictions on the types of geometric transformations
and is compatible with other transformations used in
computer graphics. This paper aims to provide a new foun-
dational framework for DBIE, and we leave the exploration
of more transformations within this framework to our future
research.
Translation
Translation refers to moving an object or region from one lo-
cation to another withou altering its shape or size. Suppose a
point p = (x, y) is translated along a direction d = (dx, dy);
its new coordinates p′ = (x′, y′) can be computed using
Equation (16): [
x′
y′
1
]
=
[
1 0 dx
0 1 dy
0 0 1
]
︸ ︷︷ ︸
Translation matrix
[
x
y
1
]
(16)
where the first term on the right-hand side of the equation is
commonly referred to as the translation matrix. Region-level
translation is achieved by applying this translation to every
point within the region. According to the handle point hi and
the target point gi, we can calculate the translation offset di
about the region ϑi as di = gi − hi = (x
g
i − x
h
i , y
g
i − y
h
i ).
Deformation
Deformation refers to a non-rigid transformation that alters
the shape of an object. As shown in Figure 12 (a), previous
✓ Center points: [[240, 295],]
✓ Handle points: [[328, 197],]
✓ Target points: [[269, 178],]
✓ Type: rotation
✓ User intention: rotate the
sculpture.
✓ Center points: None
✓ Handle points: [[232, 405],]
✓ Target points: [[122, 405],]
✓ Type: translation
✓ User intention: please move
the potted plant to the left.
Actions on our
web interfaceHandle region Editable regionImage
(a) Examples from NextBench
Save
(b) Data collection
Input image
Web
Web
rotate the sculpture.
please move the potted plant to
the left.
Figure 11: A brief illustration of samples from our NextBench.
Image & Drag Result-1? Result-2? Image & Drag Result-1!
(a) Point-based deformation:
Result-1 or Result-2?
(b) Region-oriented
deformation: Result-1!
Upgrading point-based deformation to region-oriented deformation!
Figure 12: Point-Based VS. Region-Oriented Deformation.
(a) Translation (b) Deformation
move
move
move
move
move
move
move
move
move
move
DragNeXtImage & Drag DragNeXtImage & Drag DragNeXtImage & Drag
Figure 13: Object Translation VS. Deformation. Transla-
tion is moving the entire region of an object, whereas defor-
mation can be seen as translating an object’s subregion.
DBIE methods achieve object deformation by dragging han-
dle points to corresponding target positions. In contrast, our
DragNeXt extends this paradigm from point-based to region-
oriented deformation by reformulating object deformation as
the translation of the local part of an object, ., as shown
in Figure 12 (b), dragging the edge region of the lamp to
realize its deformation. In the following, we provide more
discussions and analysis for our region-oriented deformation.
Image & Drag Gradient map Image & Drag Gradient mapDragNeXt DragNeXt
(a) Translation-1 (b) Deformation-1
(d) Deformation-2(c) Translation-2
Image & Drag Gradient mapImage & Drag Gradient map DragNeXtDragNeXt
Figure 14: Visualization of gradients back-propagated to
latent code when dragging an object or its local area.
Point-Based VS. Region-Oriented Deformation. Using
points along to instruct the dragging process has a major
drawback: handle points can only offer limited contextual
information and cannot accurately specify which regions
should be dragged or deformed. For example, the drag
points in Figure 12 (a) can be interpreted as either drag-
ging the whole lamp or the lamp’s edge, introducing severe
ambiguity into the dragging process. By contrast, extend-
ing point-based instructions to region-level guidance enables
full exploitation of the information of pixels around handle
points and can more clearly specify which regions need to be
dragged. We also observe that incorporating region-level vi-
sual contexts leads to higher-quality deformation results, .,
Result-1 in Figure 12 (b) is obviously better than Result-1 in
Figure 12 (a). These observations validate the importance of
extending point-based deformation to region-based deforma-
tion!
Region-Oriented Deformation VS. Translation. As can be
seen from Figure 13, object translation can be regarded as
moving the entire region of an object, whereas deformation
can be seen as moving an object’ subregion. In this work,
we extend the previous point-based deformation to region-
oriented deformation, enabling more reliable DBIE by lever-
aging region-level visual context. However, similar to prior
methods (Zhang et al. 2024; Jiang, Wang, and Chen 2025;
Mou et al. 2023; Lu, Li, and Han 2024; Zhao et al. 2024),
this approach still cannot achieve physics-driven deformation
results. Currently, achieving physics-driven DBIE results re-
mains highly challenging. We leave it for our future work,
and incorporate physics-based transformation functions in
our editing framework.
REMARK 8. In Figure 14, we provide a visualization of
gradients back-propagated to latent code when dragging an
object or its subregion. In Figure 14 (a) and (b), the object
and its local region are dragged over a short distance. The
gradients are primarily localized in the regions that require
manipulation, whereas the areas that do not need adjustment
remain unaffected. In Figure 14 (c) and (d), the desk lamp
and its local area are dragged over a longer distance. We
observe that, regardless of the dragging distance, regions
with distinct appearances remain separated from the target
regions. For instance, the black desk lamp does not disturb
the appearance of the white background in Figure 14 (c); in
Figure 14 (d), the background does not affect the extended
region of the desk lamp caused by the dragging operation.
This phenomenon can be attributed to strong prior knowledge
and patterns learned by pretrained diffusion models from vast
amounts of training data.
2D/3D Rotation
Rotation refers to the process of rotating an object or region
around a specified point by a certain angle. The rotation
operation can be categorized into two types: 2D rotation and
3D rotation, which are one-by-one introduced below.
2D Rotation. Suppose we aim to rotate a region around a
center point c = (xc, yc) by an angle θ. For each point p =
(x, y) in this region, the updated coordinates p′ = (x′, y′)
can be computed using Equation (17):[
x′
y′
1
]
=
[
1 0 xc
0 1 yc
0 0 1
]
︸ ︷︷ ︸
Back to c
[
cos θ − sin θ 0
sin θ cos θ 0
0 0 1
]
︸ ︷︷ ︸
Rotation matrix
[
1 0 −xc
0 1 −yc
0 0 1
]
︸ ︷︷ ︸
To the origin
[
x
y
1
]
, (17)
where the middle term on the right-hand side of the equation
is commonly referred to as the rotation matrix, while the
remaining matrices are used to translate regions either to the
origin or back to the center point c. According to the handle
point hi = (xhi , y
h
i , ), target point gi = (x
g
i , y
g
i , ), and center
point ci = (xci , y
c
i , ) given by users, we can calculate the
rotation matrix as follows:
cos θ =
(xhi − x
c
i )(x
g
i
− xci ) + (y
h
i − y
c
i )(y
g
i
− yci )√
(xh
i
− xc
i
)2 + (yh
i
− yc
i
)2
√
(x
g
i
− xc
i
)2 + (y
g
i
− yc
i
)2
(18)
sin θ =
(xhi − x
c
i )(y
g
i
− yci ) − (y
h
i − y
c
i )(x
g
i
− xci )√
(xh
i
− xc
i
)2 + (yh
i
− yc
i
)2
√
(x
g
i
− xc
i
)2 + (y
g
i
− yc
i
)2
. (19)
3D Rotation. We compare 2D and 3D rotations in Figure 15.
The 2D rotation operation can be explicitly modeled by using
a geometric transformation function. However, there is no
predefined mapping function capable of well handling 3D
rotation, since it involves a complex non-rigid transformation
84%
3%
4%
5%
4%
DargNeXt
ClipDrag
RegionDrag
FastDrag
None of the results is satisfying
Table 2: Voting results of our anonymous user study.
that inherently depends on strong priors—such as object
appearance, shape, and structural consistency—during the
dragging process. For example, as shown in the last column of
Figure 15, the warpage function, used in FastDrag (Zhao et al.
2024), easily leads to unnatural and unrealistic deterioration
of objects.
Observation 1. We have observed an interesting phe-
nomenon in our experiments: despite 3D rotation being in-
herently more challenging than 2D rotation, current DBIE
models surprisingly perform better on 3D rotation than on
2D rotation!
As shown in the third row of Figure 15, most DBIE methods
are able to rotate the 3D angle of the sculpture by directly
dragging the face leftward; by contrast, however, none of the
methods can achieve 2D rotation of the face of the sculpture
or the raccoon, as exhibited in the first and second rows.
We conclude the reasons for this counterintuitive phe-
nomenon in twofold: (i) the strong capability of current
DBIE methods in 3D rotation actually arises from the strong
prior of pretrained diffusion models, which are creative to
generate rotated objects based on latent features perturbed by
drag operations; (ii) 2D rotation can be explicitly modeled by
using geometric transformation functions, but the 2D rotation
pattern is not well captured and learned by diffusion models
during pretraining.
Point-Based Rotation VS. Region-Based Rotation. The
concept of 3D rotation in DBIE was formally introduced for
the first time in GoodDrag (Zhang et al. 2024). As shown
in Figure 16 (a), GoodDrag uses handle and target points to
instruct the dragging process of 3D rotation, ., dragging
handle points to the positions of target points. As shown in
Figure 16 (b), we also extend the point-based 3D rotation to
the region-based 3D rotation. Extending point-based instruc-
tions to region-level guidance enables full exploitation of
the information of pixels around handle points and can more
clearly specify which regions need to be dragged, thereby
helping models to achieve better 3D rotation effects.
REMARK 9. 7→Relationships between point-based 3D
rotation and translation. We argue there is no essential
difference between point-based 3D rotation and translation,
which both aim to drag handle points to the positions of target
points. But translating the handle points of an object may lead
ClipDrag RegionDragImage&Drag FastDragDragDiffusion GoodDragDragNeXt DiffEditor FreeDrag DragNoise
2
D
r
o
ta
ti
o
n
3
D
r
o
ta
ti
o
n
Figure 15: Illustrations of 2D Rotation and 3D Rotation.
(a) Point-based 3D rotation (b) Region-based 3D rotation
Figure 16: Point-Based VS. Region-Based 3D Rotation.
Point-based 3D rotation is introduced by GoodDrag (Zhang
et al. 2024), and region-based 3D rotation is used in our
DragNeXt. The figure (a) is borrowed from GoodDrag.
to diffusion models to draw 3D rotation effects, ., the car’s
3D rotation can be realized by translating its front to the right
as shown in the instruction in Figure 16. 7→Relationships
between region-based 3D rotation and translation. Similar
to the point-based rotation used in GoodDrag, Region-based
3D rotation can be seen as translating an object’s subregion.
E. More Experimental Results
Drag-Based Image Editing at Relatively Long
Distances
Dragging objects at a relatively long distance remains a major
challenge in the current field of DBIE. Most of the existing
methods only support dragging objects or regions over a short
distance and are incapable of handling long-distance drag-
based editing tasks. Although our method is not specifically
designed for long-distance DBIE, we are surprised that it
still exhibits superior performance compared to the recent
counterparts. As exemplified in Figure 17, we successfully
drag the desk lamp, the stone, and the person’s hand over a
relative long distance while maintaining high editing qual-
ity. In contrast, the compared methods either fail to achieve
long-distance dragging or to yield satisfactory quality. For
instance, FastDrag easily causes unnatural deformation of
objects, while RegionDrag is prone to resulting in artifacts
in edited regions, as we mentioned in the main body of the
paper. Also, ClipDrag, DiffEditor, and GoodDrag suffer from
severe loss of regional details during long-distance dragging.
Why is DragNeXt superior? We believe that the superiority
of our method in long-distance dragging tasks lies in two
aspects: (i) DragNeXt fully exploits region-level contextual
information, effectively alleviating the influence of losing
sparse handle points during long-distance dragging; and (ii)
it incorporates progressive guidance from intermediate drag
states, enabling a smoother and more stable dragging process.
The strength in handling long-distance dragging tasks re-
veals that our method has great potential for achieving DBIE
in complex scenarios. We plan to explore this point in our
subsequent work.
More Visualized Results
We provide more visualized results obtained by DragNeXt in
Figures 18, 19, and 20 as a supplementary for the experiments
provided in the main body of the paper. These results lead to
conclusions consistent with those discussed in Section , fur-
ther confirming the effectiveness of our approach in aligning
with user intentions. For instance, in Figure 18(d), our method
successfully moves the potted plant to the left, whereas all
compared methods fail: DragDiffusion does not alter the
plant’s position, and DiffEditor, GoodDrag, ClipDrag, Re-
gionDrag, and FastDrag produce noticeable unnatural defor-
mations. Moreover, our method also demonstrates superior
performance in 2D rotation tasks, yielding more natural and
visually consistent results than the compared counterparts,
nearly all of which fail to generate satisfactory outcomes.
We have also summarized our results in videos to provide a
clearer and more intuitive illustration of the dragging effects
yielded by our DragNeXt; for details, please kindly refer to
Image & Drag Our DragNeXt GoodDrag ClipDrag RegionDrag FastDragDiffEditor
Figure 17: Experimental results on Drag-Based Image Editing at Relatively Long Distances.
the uploaded files in the supplementary material.
F. Anonymous User Study
Since quantitative evaluation metrics may not fully demon-
strate the effectiveness of our method in addressing the ambi-
guity issue and expectation-result misalignment, we addition-
ally provide an anonymous user study, where a total of 26
participants are invited. The details about the questionnaire
is summarized in Figure 21, 22, 23, and 24. The question-
naire totally consists of 15 questions, where the 12 items are
closely related to the ambiguity issues mentioned in Proposi-
tion 1, and the 3 items are used to assess the quality of edited
images. Also, for each question, five candidate options are
provided:
• the options A—D correspond to randomly ordered re-
sults generated by DragNeXt, ClipDrag, RegionDrag,
and FastDrag;
• the option E indicates that none of the results are satisfac-
tory.
The reason for limiting the options to A–E is to reduce the
participants’ workload, as selecting the most suitable result
from too many options would be time-consuming and not
user-friendly. In Figure 2, we provide the anonymous voting
results from the invited participants. As can be seen from
the figure, the voting results demonstrate the effectiveness of
method again, ., the average results from the participants
indicate that 84% of our edited images are better than those
of the compared models.
Why are ClipDrag, RegionDrag, and FastDrag Chosen
as Compared Models? These three methods are the most
relevant to our research. ClipDrag (Jiang, Wang, and Chen
2025) addresses ambiguity in DBIE by incorporating tex-
tual guidance; in contrast, we reformulate DBIE as a Latent
Region Optimization (LRO) problem to alleviate ambigu-
ity while further improving the efficiency of the alternating
workflow. DragNeXt advances RegionDrag and FastDrag by
transforming forward, optimization-free manipulation into
backward, self-interventional latent optimization, thereby
fully leveraging the prior of pretrained diffusion models to
avoid unrealistic and unnatural deterioration.
G. Limitations and Future Work
We summarize the pseudocode of our method in Algorithm 1.
Here, we elaborate on the limitations of our current work,
and introduce the corresponding plan for our future research.
• Limited types of used geometric transformation func-
tions. In our current work, we adopt two geometric
transformations—translation and 2D rotation. Although
these two transforms can realize translation, deformation,
and 2D/3D rotation effects in DBIE, there may still exist
some other useful transformation functions that have not
yet been considered, such as scaling and shearing. In the
future, we will explore more types of geometric transform
functions. The main challenge of incorporating more geo-
metric transformations lies in how to unify them into the
current format of drag instructions. For example, dragging
may lead not only to regular scaling but also to irregular or
non-uniform scaling effects; however, properly defining
scaling factors along each direction remains a non-trivial
problem.
• Physically-driven drag effects. Achieving physics-
driven editing results remains a key challenging problem
in DBIE. In this work, we extend point-guided dragging
Algorithm 1: Pseudocode of our proposed method.
Input: an input image x, user-specified handle regions
E = {ϑi}i=1,...,n and drag instructions
C = {Ti,Oi}i=1,...,n, hyperparamters T , T ′, and
K;
z0 = VAE Encoder(x),
{z1, . . . ,zT } = Inversion(z0) ; // Encoding
and inversion.
// The denosing phase begins.
for t = T to 0 do
if T ′ < t < T then
z0t ← zt;
// Performing the PBSI strategy.
for k = 0 to K − 1 do
{ρt,ki ,Πϑi→ρt,ki
}i=1,...,n =⋃
i=1,...,n
δ(ϑi, Ti,Oi, t, k);
LLRO =
∑
i=1,...,n
∥F(zkt ) ∗ ρ
t,k
i −
F(zt).()[Πϑi→ρt,ki
] ∗
ρ
t,k
i ∥1 +RM ; z
k+1
t ←− z
k
t −
∂LLRO
∂zkt
;
end
zt−1 = z
K−1
t − εΘ(z
K−1
t ; t, c);
end
else
zt−1 = zt − εΘ(zt; t, c); // Vanilla
denoising.
end
end
x̄ = VAE Decoder(z0); // Decoding latent
embeddings.
Output: an edited image x̄;
to a region-oriented geometric transformation paradigm
to alleviate ambiguity and better leverage contextual infor-
mation. In future work, we aim to realize physics-driven
editing by incorporating physically based transformation
functions.
• Long-distance DBIE. Our DragNeXt is not specifically
designed for long-distance DBIE, despite achieving su-
perior results in this challenging setting. In future work,
we plan to conduct more experiments to identify the key
factors for achieving high-quality results in long-distance
DBIE and to further enhance our DragNeXt.
• DBIE with Linear Attention. Diffusion models usually
adopt standard self-attention with quadratic complexity,
resulting in high computational costs. Recent advances in
linear attention (Zhu et al. 2025; Zhou et al. 2025b; Liao
et al. 2025) offer an effective way to alleviate this compu-
tational bottleneck. In future work, we plan to investigate
incorporating linear attention into DBIE to further accel-
erate the editing process.
ClipDrag RegionDragImage & Drag FastDragDragDiffusion GoodDrag
25s
25s
38s
41s 2s 8s
66s 2s 8s
31s 29s
32s 33s
48s 52s 112s 2s 8s
User intentions
Move the cup to
the left and do
not change its
shape.
(a)
Move the hand
holding a metallic-
copper handbell.
(i)
Please lower the
height of the desk
lamp.
(l)
Our DragNeXt DiffEditor
20s
20s
21s
Move down the
whole handle of
the coffee cup.
(b)
Please move the
potted plant to the
left.
(d)
Drag the small
cactus upward to
make it taller!
(e)
25s
25s
25s
Move the head of
the woman to the
left.
(g)
Move the woman’s
eyes upward.
(h)
Please move the
face of the man to
the left.
(j)
Please move the
head of the man to
the left.
(k)
Please make the
size of the desk
lamp bigger.
(m)
38s
25s
25s
25s
25s
32s
32s
24s
32s
21s
32s
32s
19s
20s
20s
20s
30s
45s
32s
47s
20s 29s
29s20s
20s 32s
78s21s52s
38s
71s
38s
75s
31s
31s
54s
59s
Pull the top of the
cactus downward,
shortening its
height.
(f)
25s 31s 20s 32s 33s
8s
8s
8s
8s
8s
8s
8s
8s
8s
2s
2s
2s
2s
2s
2s
2s
2s
2s
Move the coffee
cup to the top
corner.
(c)
25s 32s 38s 8s2s19s 30s
Figure 18: More experimental results—part I.
ClipDrag RegionDragImage & Drag FastDragDragDiffusion GoodDrag
7s
27s
25s
25s
38s 2s 6s
27s
31s 34s
41s
40s
32s
33s
54s
42s
62s
75s
83s
74s
63s
33s
2s
2s
2s
2s
6s
6s
8s
6s
User intentions
Rotate the cat’s
head around its
left cheek as the
central point.
(a)
Pull the top of the
hat upward.
(d)
Press the top of
the downward.
(e)
Drag the root of
the tree and
widen the road.
(g)
Pull the bear‘s
head up and
rotate its hand.
(j)
Please expand
the visible
horizon line.
(m)
Our DragNeXt DiffEditor
20s
19s
20s
20s
20s
26s 25s 20s 34s 63s 2s 6s
25s 31s 31s20s 47s
2s 6s
Please move the
hat downward
and do not change
its shape.
(f)
28s 40s 74s21s 81s 2s
9s
Please make the
tree hole smaller.
(h) 28s
Please drag the
tree hole to the
right.
(i) 27s
32s 78s
28s
20s
20s32s
78s
35s
2s
2s 8s
7s
Move the bear to
the right.
(k)
28s 33s 31s20s 43s 2s 9s
Please drag the
sun upward.
(l)
25s 34s 51s20s 53s 2s 6s
25s 32s 31s20s 34s 2s 9s
Rotate the cat’s
head to the left
around its chin as
the central point.
(b)
Please move the
head of the cat to
the left.
(c)
Figure 19: More experimental results—part II.
ClipDrag RegionDragImage & Drag FastDragDragDiffusion GoodDrag
Rotate the
raccoon’s head
around its chin as
the central point.
27s
27s
Move the
giraffe's head
and extend its
outline
31s
34s
44s
67s
71s
37s
2s
4s
8s
6s
(l)
(m)
User intentions Our DragNeXt DiffEditor
21s
20s
Rotate the
sculpture to the
left around its
nose.
(a) 26s
25s
27s
26s
25s
26s
25s
25s
27s
27s
25s
Please move the
sculpture to the
left and do not
change its pose.
(b)
Move the eyes,
nose, and mouth
of the sculpture to
the right.
(c)
Rotate the head of
the boy to the left
around his mouth.
(d)
Please move the
head of the boy to
the right corner.
(e)
Please rotate the
head of the wolf to
the right.
(f)
Please expand the
borders of the
artwork.
(g)
Please drag the
yellow flowers
upward.
(h)
Move the woman's
head to the left on
the canvas.
(i)
Please rotate the
head of the
alpaca.
(j)
Please rotate the
head of the
alpaca.
(k)
27s
27s
33s
25s
22s
41s
26s
26s
32s
30s
49s
43s
30s
28s
25s
22s
78s
39s
18s
26s
56s
83s22s
20s
20s
20s
19s
19s
20s
20s
20s
20s
20s
7s
7s
10s
7s
7s
8s
7s
6s
7s
7s
9s
51s
53s
44s
37s
39s
91s
34s
19s
57s
66s
101s
2s
2s
2s
2s
2s
2s
2s
2s
2s
2s
2s
Figure 20: More experimental results—part III.
Anonymous User Study on Drag-Based Image Editing
*Requirement: Please choose the result that best follows the editing instruction and exhibits the highest editing quality.
(1) Select the result that best follows the instruction with the highest editing quality:_____
*Editing instruction: Rotate the sculpture to the left around its nose.
E: None of the results are satisfying!
Candidates options
A B C D
(2) Select the result that best follows the instruction with the highest editing quality:______
*Editing instruction: Rotate the head of the boy to the left around his mouth.
E: None of the results are satisfying!
Candidates options
A B C D
(3) Select the result that best follows the instruction with the highest editing quality:______
*Editing instruction: Move the woman's head to he left on the canvas.
E: None of the results are satisfying!
Candidates options
A B C D
(4) Select the result that best follows the instruction with the highest editing quality:______
*Editing instruction: Please move the sculpture to the left and do not change its pose.
Candidates options
A B C D
E: None of the results are satisfying!
Figure 21: Questionnaire—Part I (questions (1)∼(4)).
(5) Select the result that best follows the instruction with the highest editing quality:_____
*Editing instruction: Please move the head of the boy to the right corner .
E: None of the results are satisfying!
Candidates options
A B C D
(6) Select the result that best follows the instruction with the highest editing quality:______
*Editing instruction: Move the cup to the left and do not change its shape.
E: None of the results are satisfying!
Candidates options
A B C D
(7) Select the result that best follows the instruction with the highest editing quality:______
*Editing instruction: Please move the potted plant to the left.
E: None of the results are satisfying!
Candidates options
A B C D
(8) Select the result that best follows the instruction with the highest editing quality:______
*Editing instruction: Lower the height of the desk lamp.
Candidates options
A B C D
E: None of the results are satisfying!
Figure 22: Questionnaire—Part II (questions (5)∼(8)).
(9) Select the result that best follows the instruction with the highest editing quality:_____
*Editing instruction: Rotate the cat’s head around its left cheek as the central point.
E: None of the results are satisfying!
Candidates options
A B C D
(10) Select the result that best follows the instruction with the highest editing quality:______
*Editing instruction: Please move the hat down.
E: None of the results are satisfying!
Candidates options
A B C D
(11) Select the result that best follows the instruction with the highest editing quality:______
*Editing instruction: Move the head of the woman to the left.
E: None of the results are satisfying!
Candidates options
A B C D
(12) Select the result that best follows the instruction with the highest editing quality:______
*Editing instruction: Move the hand holding a metallic-copper handbell.
Candidates options
A B C D
E: None of the results are satisfying!
Figure 23: Questionnaire—Part III (questions (9)∼(12)).
(13) Select the result that best follows the instruction with the highest editing quality:_____
*Editing instruction: Move down the whole handle of the coffee cup.
E: None of the results are satisfying!
Candidates options
A B C D
(14) Select the result that best follows the instruction with the highest editing quality:______
*Editing instruction: Pull the top of the cactus downward, shortening its height.
E: None of the results are satisfying!
Candidates options
A B C D
(15) Select the result that best follows the instruction with the highest editing quality:______
*Editing instruction: Drag the root of the tree and widen the road.
E: None of the results are satisfying!
Candidates options
A B C D
Figure 24: Questionnaire—Part IV (questions (13)∼(15)).