Given that:

*A GPU contains multiple SIMD processors

*Each SIMD processor contains multiple lanes.

*Each SIMD processor is assigned a single thread block (by the thread block scheduler)

The question is which one of these two alternatives is correct:

-Alt1 (parallel execution of threads): Each lane runs a single thread among all threads in the thread block -> to completely become executed, each thread takes as many clock cycles as there is elements in the vector that it writes to/reads from

-Alt2 ("sequential-alternating" execution of threads): Each thread occupies all lanes in a single SIMD processor -> each thread takes round_up(<nr_of_elements_in_the_vector>/<nr_of_lanes_per_SIMD_processor>) clock cycles to finish execution (not necessary consecutive) -> the thread scheduler (in each SIMD processor) schedules/alternates between different threads even if a single thread didn't finish all its cycles. So threads doesn't execute in parallel



(PS. Alt1 is what I understood from the GPU class/slides; Alt2 is what I understood from the book)