I think you have it backwards kinda. we do not care about ordering whatsoever, we are keeping a "partition" of processed and unprocessed elements.
essentially we have a queue with processed and unprocessed elements and a pointer to where we should process next.
// start
[P1 P2 P3 P4 *U5 U6 U7 U8]
// process element
[P1 P2 P3 P4 P5 *U6 U7 U8]
// insert processed element using push back and swap
[P1 P2 P3 P4 P5 *U6 U7 U8 P9]
[P1 P2 P3 P4 P5 *P9 U7 U8 U6]
[P1 P2 P3 P4 P5 P9 *U7 U8 U6]
// insert unprocessed element
[P1 P2 P3 P4 P5 P9 *U7 U8 U6 U10]
If we were to try to do the same with inserts, it would cause N^2 behavior. As you can see, we also do not preserve order during the swap back approach.
If we were to do insert it would look like:
[P1 P2 P3 P4 P5 *U6 U7 U8]
// insert processed element using insert
[P1 P2 P3 P4 P5 *P9 U6 U7 U8 P9]
[P1 P2 P3 P4 P5 P9 *U6 U7 U8 P9]
And the shifting would cost O(N).
Deque is a good data structure, but has an awful lot of extra overhead making it a poor fit for a performance critical code section. One of the key benefits of this approach is that we keep our data structure quick to iterate/add to.
The swap approach is O(1) per element added, which is fine. If we used insert / shifting it would be O(N) per element, which would cause a quadratic blowup.