diff --git a/assets/antoine_laurent/cheveaux.jpg b/assets/antoine_laurent/cheveaux.jpg index 4ae3003..979bfbd 100644 Binary files a/assets/antoine_laurent/cheveaux.jpg and b/assets/antoine_laurent/cheveaux.jpg differ diff --git a/assets/antoine_laurent/cheveaux.jpg.old b/assets/antoine_laurent/cheveaux.jpg.old new file mode 100644 index 0000000..4ae3003 Binary files /dev/null and b/assets/antoine_laurent/cheveaux.jpg.old differ diff --git a/assets/antoine_laurent/mammouths.jpg b/assets/antoine_laurent/mammouths.jpg index ec303d6..f850401 100644 Binary files a/assets/antoine_laurent/mammouths.jpg and b/assets/antoine_laurent/mammouths.jpg differ diff --git a/assets/antoine_laurent/mammouths.jpg.old b/assets/antoine_laurent/mammouths.jpg.old new file mode 100644 index 0000000..ec303d6 Binary files /dev/null and b/assets/antoine_laurent/mammouths.jpg.old differ diff --git a/assets/previous_work/matte.jpg b/assets/previous_work/matte.jpg index aeee1c6..13d35fd 100644 Binary files a/assets/previous_work/matte.jpg and b/assets/previous_work/matte.jpg differ diff --git a/assets/previous_work/matte.jpg.old b/assets/previous_work/matte.jpg.old new file mode 100644 index 0000000..aeee1c6 Binary files /dev/null and b/assets/previous_work/matte.jpg.old differ diff --git a/assets/previous_work/shiny.jpg b/assets/previous_work/shiny.jpg index 68647fa..d381abf 100644 Binary files a/assets/previous_work/shiny.jpg and b/assets/previous_work/shiny.jpg differ diff --git a/assets/previous_work/shiny.jpg.old b/assets/previous_work/shiny.jpg.old new file mode 100644 index 0000000..68647fa Binary files /dev/null and b/assets/previous_work/shiny.jpg.old differ diff --git a/assets/previous_work/shiny_inference.jpg b/assets/previous_work/shiny_inference.jpg new file mode 100644 index 0000000..20c8021 Binary files /dev/null and b/assets/previous_work/shiny_inference.jpg differ diff --git a/src/paper.pdf b/src/paper.pdf index 1ba683a..28be433 100644 Binary files a/src/paper.pdf and b/src/paper.pdf differ diff --git a/src/paper.tex b/src/paper.tex index 3df5faa..806f90f 100644 --- a/src/paper.tex +++ b/src/paper.tex @@ -23,6 +23,8 @@ breaklinks = true, } +\setlength{\parindent}{0cm} + \graphicspath{{../assets/}} \usepackage{lastpage} @@ -37,9 +39,12 @@ \vspace{5cm} \textbf{Bibliographie de projet long} } -\author{Laurent Fainsin} +\author{ + Laurent Fainsin \\ + {\tt\small laurent@fainsin.bzh} +} \date{ - \vspace{10cm} + \vspace{10.5cm} Département Sciences du Numérique \\ Troisième année \\ 2022 — 2023 @@ -65,7 +70,7 @@ \section{Introduction} -The field of 3D reconstruction techniques in photography, such as Reflectance Transformation Imaging (RTI)~\cite{giachetti2018} and Photometric Stereo~\cite{durou2020}, often require a precise understanding of the lighting conditions in the scene being captured. One common method for calibrating the lighting is to include one or more spheres in the scene, as shown in the left example of Figure~\ref{fig:intro}. However, manually outlining these spheres can be tedious and time-consuming, especially in the field of visual effects where the presence of chrome spheres is prevalent~\cite{jahirul_grey_2021}. This task can be made more efficient by using deep learning methods for detection. The goal of this project is to develop a neural network that can accurately detect both matte and shiny spheres in a scene, that could then be implemented in standard pipelines such as AliceVision Meshroom~\cite{alicevision2021}. +The field of 3D reconstruction techniques in photography, such as Reflectance Transformation Imaging (RTI)~\cite{giachetti2018} and Photometric Stereo~\cite{durou2020}, often requires a precise understanding of the lighting conditions in the scene being captured. One common method for calibrating the lighting is to include one or more spheres in the scene, as shown in the left example of Figure~\ref{fig:intro}. However, manually outlining these spheres can be tedious and time-consuming, especially in the field of visual effects where the presence of chrome spheres is prevalent~\cite{jahirul_grey_2021}. This task can be made more efficient by using deep learning methods for detection. The goal of this project is to develop a neural network that can accurately detect both matte and shiny spheres in a scene, that could then be implemented in standard pipelines such as AliceVision Meshroom~\cite{alicevision2021}. \begin{figure}[ht] \centering @@ -85,16 +90,13 @@ Previous work by Laurent Fainsin et al. in~\cite{spheredetect} attempted to addr \centering \begin{tabular}{cc} \includegraphics[height=0.3\linewidth]{previous_work/matte_inference.png} & - \includegraphics[height=0.3\linewidth]{previous_work/shiny_inference.png} + \includegraphics[height=0.3\linewidth]{previous_work/shiny_inference.jpg} \end{tabular} \caption{Mask R-CNN~\cite{MaskRCNN} inferences from~\cite{spheredetect} on Figure~\ref{fig:intro}.} \label{fig:previouswork} \end{figure} -The automatic detection (or segmentation) of spheres in scenes is a rather niche task and as a result there exists no known direct method to solve this problem. - -Parler des trucs qui n'ont rien à voir ici mais qui donne de l'espoir, -truc de jade et truc de PE. +In the field of deep learning, the specialized task of automatically detecting or segmenting spheres in scenes lacks a direct solution. Despite this, findings from studies in unrelated areas~\cite{dror_recognition_2003,qiu_describing_2021} indicate that deep neural networks may possess the capability to perform this task, offering hope for a performant solution. \section{Datasets} @@ -126,7 +128,7 @@ Antoine Laurent, a PhD candidate at INP of Toulouse, is working on the field of \label{fig:antoine_laurent_dataset} \end{figure} -He has compiled a dataset consisting of 400+ photographs, all of which contain 3D spherical markers, which are used to calibrate the lighting conditions and aid in the 3D reconstruction of these historical sites. +He has compiled a dataset consisting of 400+ photographs, all of which contain 3D spherical markers, which are used to calibrate the lighting conditions and aid in the 3D reconstruction of these historical sites. These images will be our basis for our datasetn, and can be seen in Figure~\ref{fig:antoine_laurent_dataset}. \newpage @@ -193,7 +195,7 @@ The network is composed of two parts: a backbone network and a region proposal n The network is trained using a loss function that is composed of three terms: the classification loss, the bounding box regression loss, and the mask loss. The classification loss is used to train the network to classify each region proposal as either a sphere or not a sphere. The bounding box regression loss is used to train the network to regress the bounding box of each region proposal. The mask loss is used to train the network to generate a mask for each region proposal. The original network was trained using the COCO dataset~\cite{COCO}. -The authors of the paper~\cite{spheredetect} achieved favorable results using the network on matte spheres, however, its performance declined when shiny spheres were introduced. This can be attributed to the fact that convolutional neural networks typically extract local features from images. Observing both the interior and exterior of a chrome sphere, as defined by a "distortion" effect, is necessary to accurately identify it. +The authors of the paper~\cite{spheredetect} achieved favorable results using the network on matte spheres, however, its performance declined when shiny spheres were introduced. This can be attributed to the fact that convolutional neural networks typically extract local features from images. Observing non local features such as the interior and exterior of a chrome sphere, as defined by a "distortion" effect, may be necessary to accurately identify it. \subsection{Ellipse R-CNN} @@ -206,7 +208,7 @@ To detect spheres in images, it is sufficient to estimate the center and radius \label{fig:ellipsercnn} \end{figure} -The Ellipse R-CNN~\cite{dong_ellipse_2021} is a modified version of the Mask R-CNN~\cite{MaskRCNN} which can detect ellipses in images, it addresses this issue by using an additional branch in the network to predict the axes of the ellipse and its orientation, which allows for more accurate detection of objects and in our case spheres. It also have a feature of handling occlusion, by predicting the segmentation mask for each ellipse, it can handle overlapping and occluded objects. This makes it an ideal choice for detecting spheres in real-world images with complex backgrounds and variable lighting conditions. +The Ellipse R-CNN~\cite{dong_ellipse_2021} is a modified version of the Mask R-CNN~\cite{MaskRCNN} which can detect ellipses in images, it addresses this issue by using an additional branch in the network to predict the axes of the ellipse and its orientation, which allows for more accurate detection of objects. It also has the feature of handling occlusion, by predicting the segmentation mask for each ellipse, in addition it can handle overlapping and occluded objects. This makes it an ideal choice for detecting spheres in real-world images with complex backgrounds and variable lighting conditions. \subsection{GPN} @@ -219,9 +221,8 @@ Gaussian Proposal Networks (GPNs) is a novel extension to Region Proposal Networ \label{fig:gpn} \end{figure} -GPNs represent bounding ellipses as 2D Gaussian distributions on the image plane and minimize the Kullback-Leibler (KL) divergence between the proposed Gaussian and the ground truth Gaussian for object localization. The KL divergence loss is used as an approximation of the regression loss in the RPN framework when the rotation angle is 0. - -GPNs could be an alternative to Ellipse-RCNN for detecting ellipses in images, but it's architecture is more complex, it could be tricky to implement and deploy to production. +GPNs represent bounding ellipses as 2D Gaussian distributions on the image plane and minimize the Kullback-Leibler (KL) divergence between the proposed Gaussian and the ground truth Gaussian for object localization. +GPNs could be an alternative to Ellipse R-CNN~\cite{dong_ellipse_2021} for detecting ellipses in images, but it's architecture is more complex, it could be tricky to implement and deploy to production. \subsection{DETR \& DINO} @@ -229,7 +230,7 @@ DETR (DEtection TRansformer)~\cite{carion_end--end_2020} is a new method propose DETR uses a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture, as seen in Figure~\ref{fig:detr}. Given a fixed small set of learned object queries, the model reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. This makes the model conceptually simple and does not require a specialized library, unlike many other modern detectors. -DETR demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset. Moreover, DETR can be easily generalized to produce panoptic segmentation in a unified manner and it significantly outperforms competitive baselines. +DETR demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster R-CNN~\cite{ren_faster_2016} baseline on the challenging COCO~\cite{COCO} object detection dataset. Moreover, DETR can be easily generalized to produce panoptic segmentation in a unified manner and it significantly outperforms competitive baselines. \begin{figure}[ht] \centering @@ -238,7 +239,7 @@ DETR demonstrates accuracy and run-time performance on par with the well-establi \label{fig:detr} \end{figure} -DINO (DETR with Improved deNoising anchOr boxes)~\cite{zhang_dino_2022} is a state-of-the-art object detector that improves on the performance and efficiency of previous DETR-like models. It utilizes a contrastive denoising training method, mixed query selection for anchor initialization, and a look-forward twice scheme for box prediction. DINO achieves a significant improvement in performance compared to the previous best DETR-like model DN-DETR. Additionally, it scales well both in terms of model size and data size compared to other models on the leaderboard. +DINO (DETR with Improved deNoising anchOr boxes)~\cite{zhang_dino_2022} is a state-of-the-art object detector that improves on the performance and efficiency of previous DETR-like models. It utilizes a contrastive denoising training method, mixed query selection for anchor initialization, and a look-forward twice scheme for box prediction. DINO achieves a significant improvement in performance compared to the previous best DETR-like model DN-DETR~\cite{li_dn-detr_2022}. Additionally, it scales well both in terms of model size and data size compared to other models on the leaderboard. \begin{figure}[ht] \centering @@ -249,7 +250,7 @@ DINO (DETR with Improved deNoising anchOr boxes)~\cite{zhang_dino_2022} is a sta \subsection{Mask2Former} -Mask2Former~\cite{cheng_masked-attention_2022} is a recent development in object detection and instance segmentation tasks. It leverages the strengths of two popular models in this field: Transformer-based architectures, such as DETR, and fully convolutional networks (FCN), like Mask R-CNN. +Mask2Former~\cite{cheng_masked-attention_2022} is a recent development in object detection and instance segmentation tasks. It leverages the strengths of two popular models in this field: Transformer-based architectures, such as DETR~\cite{carion_end--end_2020}, and fully convolutional networks (FCN), like Mask R-CNN~\cite{MaskRCNN}. \begin{figure}[ht] \centering @@ -264,28 +265,41 @@ Compared to Mask R-CNN, Mask2Former has a simpler architecture, with fewer compo \section{Training} -For the training process, we plan to utilize PyTorch Lightning~\cite{Falcon_PyTorch_Lightning_2019}, a high-level library for PyTorch~\cite{NEURIPS2019_9015}, and the HuggingFace Transformers~\cite{wolf-etal-2020-transformers} library for our transformer model. The optimizer we plan to use is AdamW~\cite{loshchilov_decoupled_2019}, a variation of the Adam~\cite{kingma_adam_2017} optimizer that is well-suited for training deep learning models. We aim to ensure reproducibility by using Nix for our setup and we will use Poetry for managing Python dependencies. This combination of tools is expected to streamline the training process and ensure reliable results. +For the training process, we plan to utilize PyTorch Lightning~\cite{Falcon_PyTorch_Lightning_2019}, a high-level library for PyTorch~\cite{NEURIPS2019_9015}, and the HuggingFace Transformers~\cite{wolf-etal-2020-transformers} library for our transformer model. The optimizer we plan to use is AdamW~\cite{loshchilov_decoupled_2019}, a variation of the Adam~\cite{kingma_adam_2017} optimizer that is well-suited for training deep learning models. We aim to ensure reproducibility by using Nix~\cite{nix} for our setup and we will use Poetry~\cite{poetry} for managing Python dependencies. This combination of tools is expected to streamline the training process and ensure reliable results. \subsection{Loss functions} +In computer vision models such as Faster R-CNN~\cite{ren_faster_2016} and Mask R-CNN~\cite{MaskRCNN}, loss functions play a crucial role in the training process. They define the objective that the model aims to minimize during training, and the optimization of the loss function leads to the convergence of the model to a desired performance. + +Faster R-CNN uses two loss functions: a classification loss and a regression loss. The classification loss measures the difference between the predicted object class and the ground truth class. It is usually calculated using the cross-entropy loss function. The regression loss measures the difference between the predicted bounding box and the ground truth bounding box. It is usually calculated using the smooth L1 loss function, which is a differentiable approximation of the L1 loss function. + +Mask R-CNN, on the other hand, adds a segmentation loss to the losses used in Faster R-CNN. The segmentation loss measures the difference between the predicted segmentation mask and the ground truth mask. It is usually calculated using the binary cross-entropy loss function. The binary cross-entropy loss function measures the difference between the predicted binary mask and the ground truth binary mask. + +DETR uses a bipartite matching loss for training. The bipartite matching loss measures the difference between the predicted set of detections and the ground truth set of objects. It is calculated as the sum of pairwise distances between the predicted and ground truth detections, where the distance between two detections is defined as the negative IoU between their bounding boxes. The bipartite matching loss is designed to handle the permutation invariance of the detections, which is important for the detection of objects with arbitrary number and order. + \subsection{Metrics} -pytorch metrics~\cite{TorchMetrics_2022} +In object detection and instance segmentation tasks, metrics such as DICE, IoU, or mAP are commonly used to evaluate the performance of a computer vision model. -dice -IoU +Mean Average Precision (mAP) is a widely used metric in object detection. It represents the average of the Average Precision (AP) values for each object class in a dataset. The AP is the area under the Precision-Recall curve, which is a graphical representation of the precision and recall values of a model at different thresholds. mAP provides a comprehensive measure of the overall performance of a model in detecting objects of different classes in a dataset. + +Intersection over Union (IoU), also known as Jaccard index, is another widely used metric in object detection. It measures the similarity between the predicted bounding box and the ground truth bounding box by calculating the ratio of the area of their intersection to the area of their union. A high IoU value indicates a well-aligned predicted bounding box with the ground truth. + +In instance segmentation, the Dice Coefficient (DICE) is a widely used metric. It measures the similarity between the predicted segmentation mask and the ground truth mask by calculating the ratio of twice the area of their intersection to the sum of their areas. A DICE value of 1 indicates a perfect match between the predicted and ground truth masks. + +These metrics are available in the TorchMetrics~\cite{TorchMetrics_2022} library and provide valuable insights into the performance of object detection and instance segmentation models, enabling the identification of areas for improvement and guiding further development. \subsection{Experiment tracking} -To keep track of our experiments and their results, we will utilize Weights \& Biases (W\&B)~\cite{wandb} and Aim~\cite{Arakelyan_Aim_2020}. W\&B is a popular experiment tracking tool that provides a simple interface for logging and visualizing metrics, models, and artifacts. Aim is a collaborative machine learning platform that provides a unified way to track, compare, and explain experiments across teams and tools. By utilizing these tools, we aim to efficiently track our experiments and compare results. This will allow us to make data-driven decisions and achieve better results if we have enough time. +To keep track of our experiments and their results, we will utilize Weights \& Biases (W\&B)~\cite{wandb} or Aim~\cite{Arakelyan_Aim_2020}. W\&B is a popular experiment tracking tool that provides a simple interface for logging and visualizing metrics, models, and artifacts. Aim is a collaborative machine learning platform that provides a unified way to track, compare, and explain experiments across teams and tools. By utilizing these tools, we aim to efficiently track our experiments and compare results. This will allow us to make data-driven decisions and achieve better results if we have enough time. \section{Deployment} -For deployment, we plan to use the ONNX~\cite{ONNX} format. This format provides a standard for interoperability between different AI frameworks and helps ensure compatibility with a wide range of deployment scenarios. To ensure the deployment process is seamless, we will carefully choose an architecture that is exportable, though most popular architectures are compatible with ONNX. Our model will be run in production using ONNXRuntime~\cite{ONNX_Runtime_2018}, a framework that allows for efficient inference using ONNX models. This combination of tools and formats will ensure that our model can be deployed quickly and easily in a variety of production environments such as AliceVision Meshroom. +For deployment, we plan to use the ONNX~\cite{ONNX} format. This format provides a standard for interoperability between different AI frameworks and helps ensure compatibility with a wide range of deployment scenarios. To ensure the deployment process is seamless, we will carefully choose an architecture that is exportable, though most popular architectures are compatible with ONNX. Our model will be run in production using ONNXRuntime~\cite{ONNX_Runtime_2018}, a framework that allows for efficient inference using ONNX models. This combination of tools and formats will ensure that our model can be deployed quickly and easily in a variety of production environments such as AliceVision Meshroom~\cite{alicevision2021}. \section{Conclusion} -In conclusion, the detection of matte spheres has been explored and is possible, however, the automatic detection of chrome spheres has not been fully investigated. The initial step towards this goal would be to evaluate the capabilities of transformer-based architectures, such as DETR, in detecting chrome spheres. If successful, further improvements can include the prediction of bounding ellipses instead of just bounding boxes, exporting the model to the ONNX format, and implementation inside the Alicevision Meshroom software. +In conclusion, the detection of matte spheres has been explored and is possible, however, the automatic detection of chrome spheres has not been fully investigated. The initial step towards this goal would be to evaluate the capabilities of transformer-based architectures, such as DETR, in detecting chrome spheres. If successful, further improvements can include the prediction of bounding ellipses instead of just bounding boxes (modifications to the architecture already allows to detect angled bounding boxes~\cite{dai_ao2-detr_2022}), exporting the model to the ONNX format, and deploying it inside the Alicevision Meshroom software. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% diff --git a/src/softs.bib b/src/softs.bib index 2ed5655..081c278 100644 --- a/src/softs.bib +++ b/src/softs.bib @@ -135,10 +135,26 @@ url = {http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf} } -@sofwate{polyhaven, +@software{polyhaven, title = {{Poly Haven}: 3D models for everyone}, url = {https://polyhaven.com/}, license = {CC-BY-NC-4.0}, author = {{Poly Haven team}}, year = {2021} } + +@software{nix, + title = {Nix: The purely functional package manager}, + url = {https://nixos.org/}, + author = {Eelco Dolstra and NixOS Foundation}, + license = {MIT}, + year = {2013-2023} +} + +@sofware{poetry, + title = {Poetry: Python dependency management and packaging made easy}, + url = {https://github.com/python-poetry/poetry/}, + author = {Sébastien Eustace}, + license = {MIT}, + year = {2018-2023} +} diff --git a/src/zotero.bib b/src/zotero.bib index 7ca09af..c89594d 100644 --- a/src/zotero.bib +++ b/src/zotero.bib @@ -12,11 +12,12 @@ file = {Snapshot:/home/laurent/Zotero/storage/DXQJISMX/detr-object-detection.html:text/html}, } -@article{dror_recognition_nodate, +@article{dror_recognition_2003, title = {Recognition of {Surface} {Reflectance} {Properties} from a {Single} {Image} under {Unknown} {Real}-{World} {Illumination}}, abstract = {This paper describes a machine vision system that classifies reflectance properties of surfaces such as metal, plastic, or paper, under unknown real-world illumination. We demonstrate performance of our algorithm for surfaces of arbitrary geometry. Reflectance estimation under arbitrary omnidirectional illumination proves highly underconstrained. Our reflectance estimation algorithm succeeds by learning relationships between surface reflectance and certain statistics computed from an observed image, which depend on statistical regularities in the spatial structure of real-world illumination. Although the algorithm assumes known geometry, its statistical nature makes it robust to inaccurate geometry estimates.}, language = {en}, author = {Dror, Ron O and Adelson, Edward H and Willsky, Alan S}, + year = {2003}, file = {Dror et al. - Recognition of Surface Reflectance Properties from .pdf:/home/laurent/Zotero/storage/HJXFDDT6/Dror et al. - Recognition of Surface Reflectance Properties from .pdf:application/pdf}, } @@ -431,3 +432,48 @@ Publisher: IEEE}, annote = {Comment: Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015}, file = {arXiv Fulltext PDF:/home/laurent/Zotero/storage/EQ38Q4BJ/Kingma and Ba - 2017 - Adam A Method for Stochastic Optimization.pdf:application/pdf;arXiv.org Snapshot:/home/laurent/Zotero/storage/JSNDPECJ/1412.html:text/html}, } + +@misc{ren_faster_2016, + title = {Faster {R}-{CNN}: {Towards} {Real}-{Time} {Object} {Detection} with {Region} {Proposal} {Networks}}, + shorttitle = {Faster {R}-{CNN}}, + url = {http://arxiv.org/abs/1506.01497}, + doi = {10.48550/arXiv.1506.01497}, + abstract = {State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features---using the recently popular terminology of neural networks with 'attention' mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available.}, + urldate = {2023-02-06}, + publisher = {arXiv}, + author = {Ren, Shaoqing and He, Kaiming and Girshick, Ross and Sun, Jian}, + month = jan, + year = {2016}, + note = {arXiv:1506.01497 [cs]}, + keywords = {Computer Science - Computer Vision and Pattern Recognition}, + annote = {Comment: Extended tech report}, + file = {arXiv Fulltext PDF:/home/laurent/Zotero/storage/SHZFG4RW/Ren et al. - 2016 - Faster R-CNN Towards Real-Time Object Detection w.pdf:application/pdf;arXiv.org Snapshot:/home/laurent/Zotero/storage/F3VRI6F7/1506.html:text/html}, +} + +@misc{noauthor_end--end_2023, + title = {End-to-{End} {Detection} {Transformer} ({DETR})}, + url = {https://neuralception.com/objectdetection-detr/}, + abstract = {A brief explanation of how the detection transformer (DETR) and self-attention work.}, + language = {en}, + urldate = {2023-02-06}, + month = feb, + year = {2023}, + file = {Snapshot:/home/laurent/Zotero/storage/CQBYUSC4/objectdetection-detr.html:text/html}, +} + +@misc{li_dn-detr_2022, + title = {{DN}-{DETR}: {Accelerate} {DETR} {Training} by {Introducing} {Query} {DeNoising}}, + shorttitle = {{DN}-{DETR}}, + url = {http://arxiv.org/abs/2203.01305}, + doi = {10.48550/arXiv.2203.01305}, + abstract = {We present in this paper a novel denoising training method to speedup DETR (DEtection TRansformer) training and offer a deepened understanding of the slow convergence issue of DETR-like methods. We show that the slow convergence results from the instability of bipartite graph matching which causes inconsistent optimization goals in early training stages. To address this issue, except for the Hungarian loss, our method additionally feeds ground-truth bounding boxes with noises into Transformer decoder and trains the model to reconstruct the original boxes, which effectively reduces the bipartite graph matching difficulty and leads to a faster convergence. Our method is universal and can be easily plugged into any DETR-like methods by adding dozens of lines of code to achieve a remarkable improvement. As a result, our DN-DETR results in a remarkable improvement (\$+1.9\$AP) under the same setting and achieves the best result (AP \$43.4\$ and \$48.6\$ with \$12\$ and \$50\$ epochs of training respectively) among DETR-like methods with ResNet-\$50\$ backbone. Compared with the baseline under the same setting, DN-DETR achieves comparable performance with \$50{\textbackslash}\%\$ training epochs. Code is available at {\textbackslash}url\{https://github.com/FengLi-ust/DN-DETR\}.}, + urldate = {2023-02-06}, + publisher = {arXiv}, + author = {Li, Feng and Zhang, Hao and Liu, Shilong and Guo, Jian and Ni, Lionel M. and Zhang, Lei}, + month = dec, + year = {2022}, + note = {arXiv:2203.01305 [cs]}, + keywords = {Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition}, + annote = {Comment: Extended version from CVPR 2022}, + file = {arXiv Fulltext PDF:/home/laurent/Zotero/storage/N7NA2XDB/Li et al. - 2022 - DN-DETR Accelerate DETR Training by Introducing Q.pdf:application/pdf;arXiv.org Snapshot:/home/laurent/Zotero/storage/DM3P2FKW/2203.html:text/html}, +}