I'm not sure sending the whole convo, with images, screenshots, and everything, each time is efficient. It slows down the computer interaction. Maybe implementing kind of "aggregated" memory would ...
After each run, it sends the whole screenshot, whichs means the model needs to analyze the whole image again and again.
Which is costly and not very efficient.
Normally just screen updates sh...