ViTO: Vision Transformer Operator
Please login to view abstract download link
In recent years, the latest trend of machine learning models is transformers. They are the core of the Large Language Models (including ChatGPT), generative art (Midjourney and DALL-E), and more. Transformers were first introduced for Natural Language Processing, and the impeccable performance urged scientists to explore various usages of transformers in other applications. Recently, a new development was introduced – the Vision Transformers (ViTs). Aimed to achieve better context extraction from images, the ViTs took over the image processing domain, beating benchmarks in almost every task. In this work, we propose using ViTs for solving inverse problems. We created an architecture that utilizes a U-Net structure with a ViT in the latent dimension. We train it to map a state of the system into either the original state (initial condition), or the parameters of the problem (such as material properties, right-hand-side, etc.). Since we are using one trained network for many different problems, it is in fact learning an inverse operator, hence the name ViTO – Vision Transformer Operator. We test the ViTO for several applications and achieve very accurate results, especially for super resolution.