feat: yay more shit

This commit is contained in:
Laureηt 2023-01-25 13:45:27 +01:00
parent 41547c2133
commit eb9ec0f0d4
Signed by: Laurent
SSH key fingerprint: SHA256:kZEpW8cMJ54PDeCvOhzreNr4FSh6R13CMGH/POoO8DI
4 changed files with 64 additions and 27 deletions

BIN
assets/DINO.pdf Normal file

Binary file not shown.

Binary file not shown.

View file

@ -1,5 +1,5 @@
\documentclass[
11pt,
12pt,
a4paper
]{article}
@ -42,13 +42,13 @@
\section{Introduction}
The field of 3D reconstruction techniques in photography, such as Reflectance Transformation Imaging (RTI)~\cite{giachetti2018} and Photometric Stereo~\cite{durou2020}, often require a precise understanding of the lighting conditions in the scene being captured. One common method for calibrating the lighting is to include one or more spheres in the scene, as shown in the left example of Figure~\ref{fig:intro}. However, manually outlining these spheres can be tedious and time-consuming, especially in the field of visual effects where the presence of chrome spheres is prevalent~\cite{jahirul_grey_2021}. This task can be made more efficient by using deep learning methods for detection. The goal of this project is to develop a neural network that can accurately detect both matte and shiny spheres in a scene.
The field of 3D reconstruction techniques in photography, such as Reflectance Transformation Imaging (RTI)~\cite{giachetti2018} and Photometric Stereo~\cite{durou2020}, often require a precise understanding of the lighting conditions in the scene being captured. One common method for calibrating the lighting is to include one or more spheres in the scene, as shown in the left example of Figure~\ref{fig:intro}. However, manually outlining these spheres can be tedious and time-consuming, especially in the field of visual effects where the presence of chrome spheres is prevalent~\cite{jahirul_grey_2021}. This task can be made more efficient by using deep learning methods for detection. The goal of this project is to develop a neural network that can accurately detect both matte and shiny spheres in a scene, that could then be implemented in standard pipelines such as AliceVision Meshroom~\cite{alicevision2021}.
\begin{figure}[ht]
\centering
\begin{tabular}{cc}
\includegraphics[height=0.35\linewidth]{matte.jpg} &
\includegraphics[height=0.35\linewidth]{shiny.jpg}
\includegraphics[height=0.3\linewidth]{matte.jpg} &
\includegraphics[height=0.3\linewidth]{shiny.jpg}
\end{tabular}
\caption{Left: a scene with matte spheres. Right: a scene with a shiny sphere.}
\label{fig:intro}
@ -61,8 +61,8 @@ Previous work by Laurent Fainsin et al. in~\cite{spheredetect} attempted to addr
\begin{figure}[ht]
\centering
\begin{tabular}{cc}
\includegraphics[height=0.35\linewidth]{matte_inference.png} &
\includegraphics[height=0.35\linewidth]{shiny_inference.png}
\includegraphics[height=0.3\linewidth]{matte_inference.png} &
\includegraphics[height=0.3\linewidth]{shiny_inference.png}
\end{tabular}
\caption{Mask R-CNN~\cite{MaskRCNN} inferences from~\cite{spheredetect} on Figure~\ref{fig:intro}.}
\label{fig:previouswork}
@ -72,7 +72,7 @@ The automatic detection (or segmentation) of spheres in scenes is a rather niche
\section{Datasets}
In~\cite{spheredetect}, it is explained that clean photographs with spherical markers for use in 3D reconstruction techniques are unsurprisingly rare. To address this issue, the authors of the paper crafted a training dataset using python and blender scripts. This was done by compositing known spherical markers (real or synthetic) onto background images from the COCO dataset~\cite{COCO}. The result of such technique is visible in Figure~\ref{fig:spheredetect_dataset}.
In~\cite{spheredetect}, it is explained that obtaining clean photographs with spherical markers for use in 3D reconstruction techniques is unsurprisingly rare. To address this issue, the authors of the paper created a dataset for training their model using custom python and blender scripts. This involved compositing known spherical markers (real or synthetic) onto background images from the COCO dataset~\cite{COCO}. The resulting dataset can be seen in Figure~\ref{fig:spheredetect_dataset}.
\begin{figure}[ht]
\centering
@ -84,32 +84,43 @@ In~\cite{spheredetect}, it is explained that clean photographs with spherical ma
\label{fig:spheredetect_dataset}
\end{figure}
in the same way one you could generate synthetic images of chrome spheres using free (C0) env map from
\cite{haven_hdris_nodate}
Additionally, synthetic images of chrome spheres can also be generated using free (CC0 1.0 Universal Public Domain Dedication) environment maps from~\cite{haven_hdris_nodate}. These environment maps provide a wide range of realistic lighting conditions and can be used to simulate different lighting scenarios, such as different times of day, weather conditions, or indoor lighting setups. This can help to further increase the diversity of the dataset and make the model more robust to different lighting conditions, which is crucial for the task of detecting chrome sphere markers.
\subsection{Antoine Laurent}
Antoine Laurent, a PhD candidate at INP of Toulouse, is working on the field of 3D reconstruction techniques in photography with the REVA team (IRIT) and on the preservation of archaeological sites with the TRACES PSH team. He is also an active member of the scientific team for the Chauvet cave project, where he travels around France to take high-resolution photographs of cave paintings, menhir statues, and other historical monuments.
\begin{figure}[ht]
\centering
\begin{tabular}{cc}
\includegraphics[height=0.3\linewidth]{antoine_laurent_1.jpg} &
\includegraphics[height=0.3\linewidth]{antoine_laurent_2.jpg}
\end{tabular}
\caption{Example of clean photographs with spehrical markers from Antoine Laurent.}
\caption{Example of clean photographs with 3D spherical markers from Antoine Laurent.}
\label{fig:antoine_laurent_dataset}
\end{figure}
He has compiled a dataset consisting of 400+ photographs, all of which contain 3D spherical markers, which are used to calibrate the lighting conditions and aid in the 3D reconstruction of these historical sites.
\newpage
\subsection{DeepLight}
DeepLight~\cite{legendre_deeplight_2019} is a research paper from Google that presents a deep learning-based approach for estimating the lighting conditions in mixed reality (MR) scenes captured by mobile devices. The goal of this research is to enhance the realism of MR by providing accurate estimates of the lighting conditions in the real-world scene.
\begin{figure}[ht]
\centering
\includegraphics[height=0.3\linewidth]{deeplight.png}
\caption{Example the dataset from~\cite{legendre_deeplight_2019}.}
\includegraphics[height=0.4\linewidth]{deeplight.png}
\caption{Dataset acquisition technique from~\cite{legendre_deeplight_2019}.}
\label{fig:deeplight_dataset}
\end{figure}
The authors propose a deep learning-based model called DeepLight, which takes an RGB image captured by a mobile device as input and estimates the lighting conditions in the scene, including the color and direction of the light sources. The model is trained on a dataset of real-world images captured in various lighting conditions and the direction of lights are extracted from spherical markers as shown in Figure~\ref{fig:deeplight_dataset}. The authors demonstrated that the model can estimate the lighting conditions in new unseen images with high accuracy. This dataset could be useful for training our model to detect chrome spheres in images as it contains a wide range of lighting conditions.
\subsection{Multi-Illumination Images in the Wild}
In the paper "A Dataset of Multi-Illumination Images in the Wild"~\cite{murmann_dataset_2019}, the authors present a dataset containing over 1000 real-world scenes and their corresponding panoptic segmentation, captured under 25 different lighting conditions. This dataset can be used as a valuable resource for various computer vision tasks such as relighting, image recognition, object detection and image segmentation. The dataset, which is composed of a wide variety of lighting conditions, can be useful in training models to detect chrome spheres in images, as it allows the model to be robust to different scenarios, improving its performance in real-world applications.
\begin{figure}[ht]
\centering
\begin{tabular}{cc}
@ -120,13 +131,11 @@ in the same way one you could generate synthetic images of chrome spheres using
\label{fig:murmann_dataset}
\end{figure}
\subsection{Labelling}
\subsection{Labelling \& Versionning}
\cite{noauthor_label_nodate}
Label Studio~\cite{noauthor_label_nodate} is an open source web-based annotation tool that allows multiple annotators to label data simultaneously and provides a user-friendly interface for creating annotation tasks. It also enables to manage annotation projects, assign tasks to different annotators, and view the progress of the annotation process. It also allows to version the data and can handle different annotation formats.
\subsection{Versionning}
\cite{noauthor_datasets_nodate}
The output of such annotators can be integrated with HuggingFace Datasets~\cite{noauthor_datasets_nodate} library, which allows to load, preprocess, share and version datasets, and easily reproduce experiments. This library has built-in support for a wide range of datasets and can handle different file formats, making it easy to work with data from multiple sources. By integrating these tools, one can have a powerful pipeline for annotation, versioning, and sharing datasets, which can improve reproducibility and collaboration in computer vision research and development.
\section{Models}
@ -134,15 +143,15 @@ in the same way one you could generate synthetic images of chrome spheres using
In~\cite{spheredetect}, the authors use Mask R-CNN~\cite{MaskRCNN} as a base model for their task. Mask R-CNN is a neural network that is able to perform instance segmentation, which is the task of detecting and segmenting objects in an image.
The network is composed of two parts: a backbone network and a region proposal network (RPN). The backbone network is a convolutional neural network that is used to extract features from the input image. The RPN is a fully convolutional network that is used to generate region proposals, which are bounding boxes that are used to crop the input image. The RPN is then used to generate a mask for each region proposal, which is used to segment the object in the image.
\begin{figure}[ht]
\centering
\includegraphics[width=0.6\linewidth]{MaskRCNN.png}
\includegraphics[height=0.3\linewidth]{MaskRCNN.png}
\caption{The Mask-RCNN~\cite{MaskRCNN} architecture.}
\label{fig:maskrcnn}
\end{figure}
The network is composed of two parts: a backbone network and a region proposal network (RPN). The backbone network is a convolutional neural network that is used to extract features from the input image. The RPN is a fully convolutional network that is used to generate region proposals, which are bounding boxes that are used to crop the input image. The RPN is then used to generate a mask for each region proposal, which is used to segment the object in the image.
The network is trained using a loss function that is composed of three terms: the classification loss, the bounding box regression loss, and the mask loss. The classification loss is used to train the network to classify each region proposal as either a sphere or not a sphere. The bounding box regression loss is used to train the network to regress the bounding box of each region proposal. The mask loss is used to train the network to generate a mask for each region proposal. The original network was trained using the COCO dataset~\cite{COCO}.
While the authors of the paper~\cite{spheredetect} obtain good results from this network on matte spheres, their performance drop when shiny spheres are introduced. This could be explained by the fact that convolutional neural network tend to extract local features from images. Indeed, you can only really indentify a chrome sphere if you can observe the "interior and exterior" of the sphere, delimited by a "distortion" effect.
@ -151,34 +160,53 @@ While the authors of the paper~\cite{spheredetect} obtain good results from this
To detect spheres in images, it is sufficient to estimate the center and radius of their projected circles. However, due to the perspective nature of photographs, the circles are often distorted and appear as ellipses.
The Ellipse R-CNN~\cite{dong_ellipse_2021} is a modified version of the Mask R-CNN~\cite{MaskRCNN} which can detect ellipses in images, it addresses this issue by using an additional branch in the network to predict the axes of the ellipse and its orientation, which allows for more accurate detection of objects and in our case spheres. It also have a feature of handling occlusion, by predicting the segmentation mask for each ellipse, it can handle overlapping and occluded objects. This makes it an ideal choice for detecting spheres in real-world images with complex backgrounds and variable lighting conditions.
\begin{figure}[ht]
\centering
\includegraphics[width=0.6\linewidth]{EllipseRCNN.png}
\includegraphics[height=0.3\linewidth]{EllipseRCNN.png}
\caption{The Ellipse R-CNN~\cite{dong_ellipse_2021} architecture.}
\label{fig:ellipsercnn}
\end{figure}
The Ellipse R-CNN~\cite{dong_ellipse_2021} is a modified version of the Mask R-CNN~\cite{MaskRCNN} which can detect ellipses in images, it addresses this issue by using an additional branch in the network to predict the axes of the ellipse and its orientation, which allows for more accurate detection of objects and in our case spheres. It also have a feature of handling occlusion, by predicting the segmentation mask for each ellipse, it can handle overlapping and occluded objects. This makes it an ideal choice for detecting spheres in real-world images with complex backgrounds and variable lighting conditions.
\subsection{GPN}
Gaussian Proposal Networks (GPNs) is a novel extension to Region Proposal Networks (RPNs), for detecting lesion bounding ellipses. The main goal of its original paper~\cite{li_detecting_2019} was to improve lesion detection systems that are commonly used in computed tomography (CT) scans, as lesions are often elliptical objects. RPNs are widely used in lesion detection, but they only propose bounding boxes without fully leveraging the elliptical geometry of lesions.
\begin{figure}[ht]
\centering
\includegraphics[width=0.6\linewidth]{GPN.png}
\includegraphics[height=0.4\linewidth]{GPN.png}
\caption{The GPN~\cite{li_detecting_2019} architecture.}
\label{fig:gpn}
\end{figure}
\subsection{DETR}
GPNs represent bounding ellipses as 2D Gaussian distributions on the image plane and minimize the Kullback-Leibler (KL) divergence between the proposed Gaussian and the ground truth Gaussian for object localization. The KL divergence loss is used as an approximation of the regression loss in the RPN framework when the rotation angle is 0.
GPNs could be an alternative to Ellipse-RCNN for detecting ellipses in images, but it's architecture is more complex, it could be tricky to implement and deploy to production.
\subsection{DETR \& DINO}
DETR (DEtection TRansformer)~\cite{carion_end--end_2020} is a new method that views object detection as a direct set prediction problem. The main goal of DETR is to streamline the detection pipeline by removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly encode prior knowledge about the task.
DETR uses a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture, as seen in Figure~\ref{fig:detr}. Given a fixed small set of learned object queries, the model reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. This makes the model conceptually simple and does not require a specialized library, unlike many other modern detectors.
DETR demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset. Moreover, DETR can be easily generalized to produce panoptic segmentation in a unified manner and it significantly outperforms competitive baselines. The training code and pretrained models are available at the project's website.
\begin{figure}[ht]
\centering
\includegraphics[width=0.8\linewidth]{DETR.png}
\includegraphics[height=0.2\linewidth]{DETR.png}
\caption{The DETR~\cite{carion_end--end_2020} architecture.}
\label{fig:detr}
\end{figure}
+ \cite{zhang_dino_2022}
DINO (DETR with Improved deNoising anchOr boxes)~\cite{zhang_dino_2022} is a state-of-the-art object detector that improves on the performance and efficiency of previous DETR-like models. It utilizes a contrastive denoising training method, mixed query selection for anchor initialization, and a look-forward twice scheme for box prediction. DINO achieves a significant improvement in performance compared to the previous best DETR-like model DN-DETR. Additionally, it scales well both in terms of model size and data size compared to other models on the leaderboard.
\begin{figure}[ht]
\centering
\includegraphics[height=0.3\linewidth]{DINO.pdf}
\caption{The DINO~\cite{zhang_dino_2022} architecture.}
\label{fig:dino}
\end{figure}
\subsection{Mask2Former}

View file

@ -46,3 +46,12 @@
booktitle = {Proceedings of QCAV},
year = {2023}
}
@inproceedings{alicevision2021,
title = {{A}liceVision {M}eshroom: An open-source {3D} reconstruction pipeline},
author = {Carsten Griwodz and Simone Gasparini and Lilian Calvet and Pierre Gurdjos and Fabien Castan and Benoit Maujean and Gregoire De Lillo and Yann Lanthony},
booktitle = {Proceedings of the 12th ACM Multimedia Systems Conference - {MMSys '21}},
doi = {10.1145/3458305.3478443},
publisher = {ACM Press},
year = {2021}
}