Learning Point Processes and Convolutional Neural Networks for Object Detection in Satellite Images

: Convolutional neural networks (CNN) have shown great results for object-detection tasks


Introduction
While object detection has been thoroughly studied for the last 20 years [1], the detection of small objects in optical satellite images remains a challenging task: With limited spatial resolution (around 0.5 m/pixel), objects such as cars can be only a few pixels wide.Moreover, this tiny-object scattering density varies greatly, and when closely packed, instances might be difficult to differentiate.However, as the geometrical configuration of one object is often dependent on its neighbors, we can leverage these priors to improve the detection results.
Methods based on convolutional neural networks (CNN) such as Faster R-CNN [2], YOLO [3], or RetinaNet [4] propose to detect objects in "natural" images.Some of these approaches first extract features through convolutions, then propose a series of boxes (anchors) that are refined by regression afterward [5,6].The specificities of remote-sensing data (high number of small objects) motivate the use of anchor-free methods, relying on heatmap (probability of object center) inference for oriented vehicle detection [7][8][9], or ship detection [10] (here approximating the heatmap through a vector field).
In remote sensing, the sensor is intrinsically quite distant from the observed objects, implying a limited spatial resolution (usually measured in meters per pixel).This limited resolution, in turn, induces a limited amount of visual information to perform object detection.Given the sensor noise or atmospheric perturbations, remote-sensing detection methods need to be innovative to compensate for the limited signal.Approaches such as [11,12] use the temporal information from image time series to improve the detection of small objects.However, these are not suitable for detecting small static objects.To extract object orientation, Ref. [13] focuses on extracting fine-grained features while ignoring any inter-instance interactions.The interaction between nearby objects is key to analyzing a scene for human operators.However, those are often ignored in CNN-based approaches (at least not beyond a post-processing step such as non-maximum suppression) since CNN models are-by construction-most efficient at extracting texture and local information [14].Meanwhile, long-range relationships are only caught when drastically increasing the depth of the network or by introducing dense connections.For instance, Ref. [15] models interactions through a cascade of Transformers at a great complexity cost, while [16] uses text-modal descriptors to introduce prior knowledge about the objects and their relations into the model.Meanwhile, Point Processes [17] have been used to perform object detection while relying on a stochastic geometry model to jointly solve the detection and selection of objects with priors.Point Process approaches model the probability on the whole set of points in the image, thus taking into account the interactions between objects.These approaches rely on decomposing the Point Process density into a data term that measures the correspondence of the points against the image and some prior terms that measure the coherence of the point configuration itself.These approaches have shown good results for detection tasks in images with many small objects, such as biological imagery [18] or remote sensing for road [19,20] and building [21] extraction, or tree detection [22].
Previous methods make use of data terms built upon image contrast measures between the interior and exterior of the object [18,20,23] or image gradient [24], which perform great in images with clear contrast between objects and their background.However, in the case of optical satellite images, background diversity, variable visual aspects of objects, and inconsistent illumination make these contrast measures unreliable.
In this paper, we propose the incorporation of interaction models into object-detection methods while taking advantage of the capabilities of deep convolutional neural networks.Satellite images are inherently imperfect due to atmospheric disturbances, partial occlusions, and limited spatial resolution.To compensate for this lack of visual information, it becomes essential to incorporate prior knowledge about the layout of objects of interest.
Instead of increasing the complexity of the model by adding, for example, Transformers to a deep CNN architecture, we propose in this paper to combine CNN pattern extraction with the Point Process approach.The starting point of this approach is to use the output of a CNN as the data term for a Point Process model (Section 2.3).From the latter, we derive more efficient sampling methods for the Point Process that do not rely on application-specific heuristics (Section 2.4).Then, we propose to bridge the gap in terms of parameter estimation using modern learning techniques inspired by Energy-Based Models (Section 2.5).Ultimately, we introduce a novel scoring function that takes into account object interaction in measuring the confidence in the detections and enables explainable results (Section 2.6).The final part presents quantitative and qualitative results on optical satellite data (Section 3).

Materials and Methods
In this paper, we propose an object-detection method leveraging CNN-extracted information within a Point Process framework that models object interactions.First, we introduce the foundations for the Point Process on which our model is built, then detail the energy model, along with its sampling and parameter estimation method.We also propose a novel scoring method to evaluate the confidence in the output while considering the interactions and providing explainable results.
To train the CNN and infer the model parameters θ, we compile DOTA 0.5 , a dataset of vehicles in remote-sensing images with 0.5 meters per pixel resolution, from the DOTA dataset [25].This dataset DOTA 0.5 is built by sub-sampling images from DOTA to the desired spatial resolution and keeping only small-vehicle and large-vehicle classes.The data consists of images from satellite or aerial sources of urban or countryside scenes with variable densities of objects (from isolated cars on country roads to dense parking lots).The images are labeled with sets of oriented rectangles for the objects of interest.We also compile a noisy version of DOTA 0.5 :DOTA 0.5 +noise, by adding a Gaussian noise on each image (σ = 0.3) to test the resilience of the methods against noise.Moreover, we evaluate models on data provided by ADS.These aerial data are sub-sampled to the desired resolution, emulating the above-mentioned satellite characteristics.This dataset is unlabeled; thus, it will only serve to evaluate the models qualitatively and was not used for training.

Point Process for Object Detection
Point Processes model configurations of points (a finite non-ordered set y = {y 1 , . . ., y n } of elements of S) as realizations of a random variable Φ in the set of all possible configurations Y = ∞ n=0 (S × M) n (with an arbitrary amount of points).Space S corresponds to the image space, and M to the mark space.The marks can be any random variable that describes the point beyond its location, from the radius of a circle to a discrete categorization of the object.Additionally, we denote n(y) the number of points in the configuration y and Y n the set of all n points configurations.
As we look to detect vehicles, we model points as rectangles described by the following marks: width y a , length y b , and angle y α as shown in Figure 1 (with the mark space defined as M = [a min , A Point Process Φ is described by its density h relative to the uniform Point Process [17].The model of selection and interaction of points derives from an energy U, through a Gibbs density: where X is the image, θ the parameters of the model, and Z is a normalizing constant.This constant is intractable due to the non-fixed dimension of Y.However, it can be shown that Z exists (The existence and finiteness of Z are sufficient as the following computations will consider ratios of h; thus Z will cancel out.) and thus h is properly defined as the energy per point is bounded (see Appendix A.2).In short, the energy function U measures the compatibility of the configuration y with the image X; the lower the energy, the higher the compatibility (see Energy-Based Models [26]).It follows that the most compatible output y * minimizes the energy U (i.e.maximizes the density h).

Markovian Point Process
For most applications [18,20,21,23], the physical constraints of the system that is being modeled imply that the marginal density of a point y ∈ y should only depend on a neighborhood around it (e.g.vehicles only align with others within a limited distance).This property is formalized through the Markovianity of the Point Process.Definition 1. Point Process Φ is a Markov Point Process under the symmetric and reflexive relation∼if and only if, for every configuration y ∈ Y such that h(y) > 0 [17]: For every point u ∈ S, h({u} ∪ y)/h(y) only depends on u and its neighbors according to ∼ (i.e., {y ∈ y, y ∼ u}).
Theorem 1.For a Point Process derived from the energy: where N y y = {y ∈ y, y ∼ y, y = y } is the set of neighbors of y for relation ∼.Then, the Point Process is Markovian for relation ∼ 2 with y ∼ 2 y ≡ ∃u ∈ S × M, y ∼ u ∼ y .
The proof is provided in Appendix A.1.
In short, if the energy V(y, N y y , X, θ) of a point y depends on its neighbors within a distance of d max , the Point Process is Markovian for a distance 2d max .The Markovianity will prove useful to simplify the sampling procedure and run it in parallel over the whole image.

Papangelou Conditional Intensity
The reference Poisson Point Process has an intensity λ that is either constant or depends on the location (For any compact A ⊆ S, E[n(Φ)] = λ|A| if λ is constant.).The density in (1) implies the intensity is now a function of the location and neighborhood of a point: Definition 2. The Papangelou conditional intensity λ(•; •) [17] associated with a Point Process Φ, is defined as: i.e., the infinitesimal probability of finding a point in region dy around y ∈ S, given the configuration y outside dy (i.e.(dy) c ).When y ∈ y, the Papangelou conditional intensity can be computed from the energy function U as: λ(y; y \ {y}) = h(y|X) h(y \ {y}|X) = exp(U(y \ {y}, X, θ) − U(y, X, θ)). (4)

Energy-Based Model
The energy formulation of the Point Process allows the compositing of several simple point behaviors into one model.We illustrate the composition of several simple point distributions through the addition of energy terms in Figure 2: There, we show that the point distribution in (4) can be obtained by simply adding the energies used to produce (2) and ( 3).When performing object detection, the image provides information about the location and properties of objects (e.g., location-based energy as in Figure 2(2)), while prior knowl-edge of the objects of interest complements this information (e.g., a repulsion term as in Figure 2(3)); the composability of the energy model allows the factoring in of both pieces of information into a single model (as in the composition of Figure 2(4)).
More specifically: For every point y ∈ y we compute a set of energy terms V e (y, X, N y y ), e ∈ ξ and combine those-per point-as a weighted sum [20,21,23]: y, X, N y y , θ (5) with V(y, X, N y y , θ) is the energy contribution of a single object y.Energy terms can be grouped into two categories:

•
Prior terms (of the form V e (y, N y y , θ)) that only depend on y and its neighborhood N y y ; they measure the coherence of the configuration itself considering the known properties of the objects of interest (e.g., objects do not overlap).

•
Data terms (V e (y, X, θ)) are a function of y and the image X (also referred to as observation in remote sensing); these terms measure the correspondence of the points with the image.
The weights w are part of the parameters θ that will need to be set before inference on any unseen image.We discuss the estimation of parameters θ in Section 2.5.
In the rest of this section, we define the multiple energy terms V e : First, the data terms , then the prior terms.Contrary to previous Point Process (PP) approaches, we do not rely on contrast measures but rather on data terms pulled from the output of a CNN: this allows for more reliable measures and faster computation.Moreover, this new model formulation opens up new improvements in sampling the energy model and estimating its parameters.

Data Terms from a CNN
Classical likelihood measurements [18,20,23] are based on contrast measures that are crafted to fit each application.Moreover, these measures rely heavily on the high contrast between objects and their background, along with limited background diversity.We illustrate the limitations of those in Section 3.1.
On the other hand, CNN models have shown to be very efficient at extracting features from images for object detection and classification.In the following section, we will show how to interpret a CNN-based object detector output to obtain a scalar energy that measures the fitness of a configuration against an image.It allows us to go past the contrast measure design by utilizing a pre-trained CNN output.

Position Likelihood Term
CNN-based object-detection methods such as [7,8,27] make use of a heatmap to find object centers.This object center probability map is obtained by passing the image X of shape H, W, 3 (here with 3 color channels) into a fully convolutional model such as Unet [28].It outputs a tensor Z pos ∈ R H×W , from which a probability-like measure of an object center at coordinates (y i , y j ) is obtained as p(y i , y j |X) = σ( Z pos [y i , y j ]) where σ(•) is the sigmoid function, and Z pos [y i , y j ] the value of map Z pos at coordinates (y i , y j ).
We propose to reinterpret this output as the position energy as follows: with t pos a scalar threshold, allowing the movement of the inflection point of the Softplus function x → ln(1 + exp(−x)).

Mark Likelihood Term
To compute the energy associated with marks, we discretize each mark κ (κ ∈ {a, b, α} in the case of a Point Process (PP) of rectangles), into n κ classes (or value bins) in the range [κ min , κ max ].We define the integer class of a value v for mark κ as: with c κ (v) being the corresponding integer class.
Now, supposing we have a model trained to classify the mark of an object at a given position (i, j) ∈ S, such a model will produce a probability estimate of mark κ to be in class c ∈ 1, n κ , from model output Z As in [29], we can reinterpret the model output into a potential, giving: As the CNN outputs a tensor Z κ of shape (H, W, n κ ), we obtain vector Z y i ,y j κ by sampling tensor Z κ at location ( y i , y j ), as illustrated in Figure 3a.Vector Z y i ,y j κ gives an estimate of the probability of each class c once passed through a Softmax (as given by ( 8)); We illustrate this with the histogram, on top of Z y i ,y j κ in Figure 3a.The value of V κ (y, X) is then obtained by applying the formula in (9) over vector Z y i ,y j κ (as illustrated on the right-hand side of Figure 3).

Interpolation
The energies described in ( 6) and ( 9) are defined over integer coordinates in S and integer classes in M.However, points y in the configuration y have real-valued locations (y i , y j ) ∈ R 2 as is for marks in M. Thus, we propose interpolating values between exact pixel locations.For ( 9) we now have : The above has two benefits: first, (10) defines the energy for continuous values of the marks using bilinear interpolation, thus ensuring Lipschitz continuity of the energy.Second, we can pre-compute (11) once for a given image, making the mark energy computation a simple value lookup and interpolation for each point.In practice, we proceed as illustrated in Figure 3b, where a function is applied to the CNN output Z κ to obtain V κ , which is stored in memory.The interpolation gives V κ for any continuous position and mark.This precomputation is quite fast, as it is defined as an operation over a tensor, which is implicitly parallelized by tensor computation libraries such as PyTorch [30].The same holds for the position term (6).

Energy Priors
The total energy model encompasses several priors as energies, which allows regularizing configuration against the data terms.

Object Priors
These are functions of the current point y only.For k = ratio, area: with f ratio (y) = y b /y a and f area (y) = y a y b , and µ k ,σ k being, respectively, the target value and the standard deviation.These two are illustrated in the first line of Figure 4.
Illustration of some priors.For priors acting on pairs of objects (zrNbr, align, ovrlp, repls), we show the energy for a pair of objects y and u (see (14) to (18) for the aggregation of multiple pair interactions into a single energy scalar).
For our objects of interest, we find two modes in the distribution of areas and ratios: Cars and trucks.To make sure to only favor these two modes and not objects with the area of a truck and ratio of a car, we introduce a joint area and ratio prior:

Interaction Priors
The following priors depend on the neighborhood of the point y.The term in ( 14) penalizes overlapping objects (with t overlap the overlap threshold), ( 15) favors aligned objects (t α = 0 favors parallel objects, while t α = π/2 prefers perpendicular objects), ( 16) and ( 17) are, respectively, repulsive and attractive priors.Finally, (18) allows the adjustment of the energy of neighborless points.Some of these priors are illustrated in Figure 4.
V align y, N V attrc y, N V zrNbr y, N where Relu(x) = max(0, x) for x ∈ R. Note we introduced a few parameters µ ratio , σ ratio , t ovrlp , . . ., these will have to be either set by trial and error or through the parameter estimation method presented in Section 2.5.

Model Pipeline
We summarize the implementation of the overall energy model in Figure 5.As is, our architecture has the following advantages:

•
We simplify the data term computation from complex contrast measures [18,20,23] into a simple sampling and interpolation of a tensor (highlighted in green in Figure 5).

•
The backbone CNN architecture remains very simple as it does not need to model any object interactions.

•
The CNN inference-which is the most complex operation within the graph in Figure 5 (FCN block)-is limited to once per image, as the energy U can be computed for any configuration y ∈ Y without having to infer V pos and V κ a second time (pre-computed maps highlighted in lilac in Figure 5).
• New interactions or priors can be easily added to the model as energies are aggregated through a simple linear combination (see right-hand part of the pipeline in Figure 5).

•
The implementation with tensor computation libraries (here PyTorch [30]) allows for implicit parallelization on GPU (or CPU) using the tensors batch dimension.
• The latter also enables automatic differentiation of energy U, both relative to configuration y and to parameters θ, which will be, respectively, useful in Sections 2.4.3 and 2.5.4.

•
Lastly, as the energy maps are defined over a raster space (thus easy to normalize), those will be of use to design the birth map in Section 2.4.1.

Ve(•)
• Ve(y) ⊕ ×w y∈y U (y, X, θ) Precomputed maps Figure 5. Energy model pipeline.The pre-computed tensors can be reused for computing different y ∈ Y for a given image X.

Sampling Configurations from the Energy Model
For a given energy model U (supposing parameters θ are set), we aim at sampling configurations that follow the Gibbs density defined in (1).This will allow us to find the most compatible configuration y * for any image X, i.e.: As the space Y in which we are trying to minimize the energy is not of fixed density, we must resort to an adapted sampling method such as Reversible Jump Monte Carlo Markov Chain (RJMCMC) [31].While this method ensures proper sampling of the desired law, it often requires some application-specific adaptations to improve sampling time and efficiency (e.g., [20,23]).
Here, we propose to use the already computed energy maps V pos and V κ to add new points in relevant areas and to focus on the parallel sampling of the algorithm.Moreover, the implementation of our energy model allows us to leverage automatic gradient computation to implement diffusion dynamics and explore at a fixed number of points (i.e., within Y n ) guided by the whole energy model.

Birth Maps for Efficient Point Proposals
At their core, both RJMCMC [31] (an adaptation of the Metropolis-Hastings algorithm [32] to variable dimension problems) and Jump Diffusion [33] make use of Birth and Death moves to build a Markov chain (y t ) t=1,... of stationary density h(•|X) 1/T t where T t is a temperature parameter that decreases towards zero (i.e.Simulated Annealing) (Convergence is proven for a logarithmic temperature decrease, here (and in the literature) we approximate it with a geometric decrease at it is faster [17]).That way, the chain converges towards the global minimum energy.Proposed Birth (addition of points to the current y t ) and Deaths (removal of points) are accepted or rejected given an acceptance ratio that depends on the energy change induced by the respective addition or removal of points (for more details see [31,34]).
The simplest birth move is to propose new points uniformly in S × M.However, this proves highly inefficient as the density of objects in the image varies a lot.Ideally, the birth move would propose a new point u to the current configuration y, sampled with the marginal density p(u|y) as is carried out within Gibbs sampling (maximizing the chance for the point addition to be accepted).Knowing that p(u|y) ∝ h(y ∪ {u})/h(y), we could compute the marginal density using energy change induced by adding the point.However, this cannot easily be computed because of the interaction terms.Thus, we propose to approximate it by only considering the data terms (i.e.bypassing the energy change on prior terms) : From this, we define the density d to sample a new point u: with V pos and V κ the pre-computed position and marks energy maps.Since both are defined as tensors with a finite number of elements, the normalizing constant Z d in ( 21) can be computed over this discretized space of pixels and mark classes.
This allows for efficient sampling of new points, without any application-specific heuristics, by taking into account a truncated version of the energy model U that can be normalized easily.

Focused Parallel Sampling
In its canonical implementation, the RJMCMC algorithm operates sequentially over the points in the image-It may only add/remove/transform one point at a time -, which increases simulation time linearly with the number of objects.With parallelization, we aim to take advantage of the spatial Markovianity of the Point Process, as Theorem 1 states that moves further than 2d max apart are independent.
As in [35], we split the image into square cells of size d c that are each assigned one set s (each set corresponding to one color in Figure 6).These sets are referred to as mic-sets in [35] for Mutually Independent Cells.Set size d c is chosen as 2d max + 2δ max (δ max is the maximum point translation distance, introduced in the next Section), so that moves in cells of one set (or color) have independent acceptance ratios: i.e., those moves can be performed in any sequential order -thus in parallel-without any change in the probability of acceptance.In practice, we have d c = 48 (d max = 16, δ max = 8).
We aim at leveraging the birth map stemming from the pre-computed energy maps V pos and V κ to focus the sampling on parts of the image where the density of objects is high.Contrary to [35], we do not use a quadtree structure of cells, but rather a regular grid of d c sized cells.In order to achieve an efficient sampling of cells (i.e., avoiding spending time in cells with no evidence of objects), weights are assigned to each cell to focus the sampling on the relevant ones.The sampling procedure goes as follows: 1.
pick a move type (Birth, Death, or Diffusion), 2. pick one set s with probability p(s), The process of cell selection (step 3) allows for limiting the number of cells processed at once on big images, hence limiting the computational cost and focusing the sampling on areas with high object density.We pick the following densities: denoting (by extension) d(s) and d(c) the density d integrated over all cells of set s and cell c, respectively.In practice, parameter n p = 1 ensures new points are sampled over the whole image with density d; higher values of n p allow for more parallel cells to be sampled each step at the cost of straying away from density d.

Jump Diffusion: Leveraging Gradients
The canonical RJMCMC algorithm also uses local transform moves to explore Y at a fixed number of points, picking one point to translate, rotate, and scale at random in y t .This is wildly inefficient as it does not consider the energy function U it is trying to minimize.Instead, we propose to use the energy gradient to explore the space more efficiently.This is the idea behind stochastic diffusion (or Langevin) dynamics [36,37].If y t and T t denote, respectively, the configuration and temperature at time t, then the configuration for the next step of the Markov chain is given by y t+1 = y t + dy t , with step size δ: At high temperatures, the Brownian motion from dw t ensures an exploration of space.At low temperatures, the Brownian motion is negligible, and the diffusion performs as a gradient descent.This allows fine-tuning the configuration at low temperatures while considering the whole energy model (contrary to the truncated energy used for birth maps in Section 2.4.1).
The diffusion alone allows for the exploration of space Y locally around y t at a fixed dimension (fixed number of points).To explore (or jump) across dimensions, we make use of the Birth and Death moves as defined previously (see Section 2.4.1).
In practice, as parallelization requires some maximum displacement on points (see Section 2.4.2), the step dy t from y t to y t+1 is clipped; for each point y ∈ y t updated to y , we bound the i and j components by δ max (i.e., |y i − y i | ≤ δ max , and similarly for j).

Resulting Sampling Method
The resulting sampling method is outlined in Algorithm 1.The method requires the following inputs:
Here, we denote by C c (x) the restriction of configuration x to cells c. Please note that to compute U for iteration of the loop on t, and the energy maps V pos , V κ are only computed once as stated in Section 2.3.3.
Algorithm 1: Sampling method.Input: x 0 , X, θ, T 0 , n s , α 1: for t = 0, . . ., n s − 1 do Pick diffusion with probability 0.8, else jump 3: Pick mic-set s with probability p(s) Keep each c in s with probability p(c|s) to make s 5: x t+1 ← x t unvisited cells will keep the previous state for all c ∈ s do 7: if diffusion then 8: dw ∼ N (0, β) x ∼ Q c (x → •) remove a point at random or add one using the birth map T t compute Green ratio 15: C c (x t+1 ) ← C c (x ) with probability min(1, r) The above implements the sampling improvements described previously, namely: • sampling new points using the birth map (line 13), • sampling in parallel over the whole image (line 6), in practice using tensor batch dimensions, • using the energy gradient to perform Diffusion (line 9).

Estimating Model Parameters
In this section, we estimate the parameters θ, so that, for each image X, the associated ground truth configuration y GT matches the most compatible configuration, i.e.: Previous PP approaches for object detection often relied on trial-and-error parameter estimation or used linear programming [23,38].The latter method would generate a set of constraints (These constraints can be seen as a local reformulation of global constraint (26).)U(y − ) ≤ U(y GT ) where wrong configurations y − are generated by applying strong perturbations on the ground truth configuration y GT .However, this only estimates the weights w e and is prone to over-constraining when the ground truth is noisy or when considering too many constraints.Moreover, this method requires designing the procedure to generate the configurations y − , which we find has a great effect on the estimated parameters.

Maximum Likelihood Learning
To tackle the above limitations, we turn towards the Energy-Based Model (EBM) literature for new parameter estimation methods.The authors in [26] propose to learn EBM parameters by maximizing the likelihood for the data D: The negative log-likelihood is then given, for parameters θ n at step n, as: with β an inverse temperature parameter (from the Gibbs distribution) that does not affect the position of the minimum [26].The gradient of L NLL on θ n is shown to be where p(y|X i , While the integral y∈Y ∂U(y,X i ,θ n ) ∂θ n p(y|X i , θ n ) remains intractable, it can be approximated through Monte Carlo sampling, where y 0 , . . ., y N are drawn from the law defined by p(•|X i , θ n ), which yields:

Contrastive Divergence
The authors in [39,40] propose to use a single sample in their Contrastive Divergence (CD) method.This method also uses a few simulation steps for the Monte Carlo Markov chain (MCMC) to generate y − , starting from the desired answer y GT .
The general idea is to generate contrastive samples y − that follow the density derived from U(•, X, θ n ) at step n of the optimization.Then we proceed to update θ n to θ n+1 , by gradient descent, to minimize the energy of the valid sample y GT , while maximizing the energy of the contrastive sample y − (see Figure 7).Alternatively, we can augment the data and use positive samples y + = y GT + N (0, σ + ) to replace y GT as in [41].In the case of imperfect GT, this models the uncertainty over the labels while providing some data augmentation.
The loss to minimize is then: with γ > 0 the weight of regularization term R V .We introduce the regularization term to avoid an explosion of the per-point energy V(y, X, N y y , θ n ).To ensure a sparse weighting of energies (i.e.minimize the number of non-zero weights w e ) we can introduce a new regularization term as an L 1 norm on the vectors of weights [42]: The broad strokes of the estimation procedure for parameters θ are as follows: 1. Pick a pair (y GT , X) from data D.

5.
Update θ n to θ n+1 according to the gradient ∇L, with the Stochastic Gradient Descent (SGD) method [43], as illustrated in Figure 7.

6.
Loop back to step 1 until convergence.

Replay Buffer
As sampling contrastive samples y − can be time-consuming, we use the replay buffer introduced by [41].The replay buffer saves Markov chain results at the current optimization step to use for initialization in the next steps, thus saving computing time.This allows for reducing the long simulation time necessary to pass the burn-in period of the Markov Chain [44] and obtain samples y − .We use a sample from the law derived from U(•, X, θ n−1 ) to initialize the chain that samples the law derived from U(•, X, θ n ).The use of the replay buffer within one step n of the optimization loop goes as follows: 1.
With probability p B set configuration x 0 as a value of the replay buffer B picked at random.With probability 1 − p B (or if the buffer is empty) set configuration x 0 as a random configuration in Y.

3.
Use y − = x n s as a negative sample for the loss computation and update of θ n to θ n+1 .4.
This buffer allows running longer chains across epochs (i.e.obtaining higher-quality samples) while maintaining a low computational cost.

Parameters Estimation Algorithm
The final parameter estimation algorithm is as follows : At line 4 we select a temperature at which to run the Markov chain to sample the negative configuration y − (β 1 = 1, β 2 = 10 −2 , β 3 = 10 −4 ); higher temperatures allow sample diversity, while lower ones seek points of minimal energy within the current energy model.Scaling ς computed at line 6 allows rescaling the energy to a unit variance model in order to ensure consistency in sample diversity for a set temperature.It is computed over 1000 samples y ∼ U (Y ).Finally, for each temperature β and image X, we set a specific buffer B X,β .The parameter K is the number of steps the Markov chain runs for every gradient descent step.In practice, we set K = 0.02d 2 c so that the number of iterations scales with the number of pixels in a cell.Algorithm 2 we propose above provides a single procedure to estimate all parameters in θ rather than only the weights w e as done previously with linear programming in [23,38].On top of avoiding over-constraining by having no hard constraints, our method also removes the need to design a procedure to sample configurations y − as those are sampled from the current energy model U(•, X, θ n ).Finally, the replay buffer allows for qualitative samples while maintaining a limited computational burden.for all (y GT , X) in D do 4: select temperature to sample negative config.

5:
x 0 ∼ U (B X,β ) with probability p B , else U (Y ) retrieve relevant buffer compute energy scale 7: see Algorithm 18: Update θ n+1 with ∆θ n using SGD end for 14: end while

Papangelou Intensity as a Confidence Score
Classical CNN-based object-detection models for object detection (such as [8,45,46]) yield a confidence score s(y) ∈ R for each proposed object y in the image.This confidence score is often interpreted, for each detection, as proportional to the probability of proposed element y to be a true positive, s(y) ∝ p(y|X).Applying a score (or confidence) threshold t s gives a set of detections for which metrics such as precision and recall can be computed by matching the detections with the ground truth.This allows adapting the threshold according to the need for the application; i.e., some applications may require low false positives (high precision) while others require fewer missed detections (high recall).To assess the performance independently of the threshold selection, the Average Precision (AP) metric sums up the performance of a model as the area under the Precision-Recall curve.
Previous Point Process approaches [18,47,48] only compute simple metrics such as precision, recall, or F1 score for the configuration given by the sampling procedure, as no score is associated with each object detection.

Papangelou Intensity as Score
With our PP approach, we propose to introduce a scoring function, first to filter the detections given a confidence threshold and second to be able to compare our method to others using the widely used AP metric.Within the PP framework, the probability of one proposed point being an object of interest depends on the rest of the inferred configuration ŷ; thus the scoring function reflects it: s(y| ŷ \ {y}) ∝ p(y| ŷ \ {y}, X).From (3), we have that the Papangelou conditional intensity is proportional to the probability of finding a point y ∈ y in a small neighborhood dy knowing the rest of the configuration y \ {y}.We propose to use the Papangelou conditional intensity as a score : 2.6.2.Pruning Sequence However, the dependency of the score on the current configuration yields a complication while computing the AP: when applying a threshold t s to prune the configuration y into y ⊂ y, for any y ∈ y , the score s(y|y \ {y}) may differ from s(y|y \ {y}).With a score of the form s(y) that only depends on y and the image X-such as those from classical CNN models-the score from one object after pruning is unchanged.
In the PP case, we compute the scores by sequentially removing the lowest scoring point until none is left; i.e., we build a sequence of configurations y 1 ⊃ y 2 . . .y n( ŷ)−1 ⊃ y n( ŷ) ⊃ ∅, with y 1 = ŷ, for n = 1, . . ., | ŷ|: Equation ( 34) provides a pruning order y 1 , . . ., y n( ŷ) of points in ŷ.This ordering allows plotting the precision and recall curve.Indeed, to trace a Precision-Recall curve, one only requires the sequence of (Recall(t s ), Precision(t s )) pairs, which are obtained by sequentially pruning the lowest scoring points.Equation ( 35) provides a score for each point y n .

Contrastive Divergence Loss and Papangelou Intensity
On the one hand, the energy model is trained by minimizing the loss function in (31) derived from the likelihood maximization of the parameters regarding the annotated data.On the other, we evaluate the performance of the inferred configuration with the scoring method in (35) sourced from the Papangelou intensity.Here, we show that while the two are derived differently, the minimization of the loss function leads to good properties on the score function.
Here, we consider a simplified loss with only the two energy terms (as γ 0).Denoting the energy change induced by the move from configuration y to x as ∆U(y → x) = U(x) − U(y), we have : Similarly, the Papangelou intensity can be rewritten as such: Single Point Addition Thus, for a simple negative sample y − = y + ∪ {u} in which we add a non-valid point u to y + , we have: This leads to the expected behavior: minimizing the loss L leads to minimizing the score of non-valid point u.The same stands for the removal of a valid point y ∈ y + and maximizing its score.

Arbitrary Sequence of Moves
This is also valid for the generic case where y − is generated from an arbitrary sequence of additions or removal of points (A translation/rotation/scaling can be viewed as removal and addition.)from y + .This defines a sequence (y k ) k=0,...,n of n configurations as: with y 0 = y + , y − = y n , and y k elements of either S × M or y + .Without loss of generality, we can reorder the sequence to match the pruning order defined in (34).The energy change for one move is given as: As we have (by definition) ∆U(x → x ) = ∆U(x → x ) + ∆U(x → x ), the loss is given as: ) − ∑ ).
By ordering the y k , y k to match the pruning order in (34) each λ(y k ; . . . ) can be matched to their respective score: • (41)(a) corresponds to non-valid points added to y + , their score is minimized as the loss is decreased; • (41)(b) corresponds to valid points removed from y + .Their score is increased as the loss is minimized.
With this, we showed that minimization of the loss at a configuration level leads to the expected results on object scores.
While this score function allows taking into account the interaction between objects that the PP model introduces, the decomposable nature of the PP (see ( 5)) allows for the interpretability of the results as demonstrated in Section 3.4.

Method Summary
In this section, we proposed a model for object detection that combines CNN architectures with a Point Process (PP) framework.Before applying our approach to real data, we shall summarize its function here.Considering configurations of objects cy, image X, and parameters θ, we look to build an energy model U(y, X, θ) such that the minimum energy point of it corresponds to the ground truth configuration for that image y GT .As such, we build U to be a combination of multiple energy terms (Equation ( 5)), with prior terms that encode the interaction model of our objects (where high energies repel unwanted configurations and lower ones favor others), and data terms that we build from the features from a CNN (Section 2.3.1).We refer the reader to Figure 5 for an overall view of the model pipeline.Within this framework, we can combine the information from the image with the prior knowledge of the configurations.
To estimate the parameters θ, we propose to use Contrastive Divergence.In short, we initialize the parameters at random and alternate between sampling low-energy configurations with the current parameters-that are nowhere close to the ground truth at the beginning of the procedure-and updating the parameters to increase the energy of these samples while maintaining low energy for the ground truth (Section 2.5).This allows us to estimate any differentiable parameters in the model, while previous methods would often become over-constrained and could only estimate linear parameters.Now, given parameters θ are set, we want to estimate the configuration of an unseen image X.This is achieved by sampling the configuration y * that minimizes the energy through a Jump Diffusion mechanism.Alternatively, proposing to add or remove points in the configuration guided by the energy maps the CNN provides (Section 2.4.1) and lowering the energy locally at a fixed number of points with gradient descent (Section 2.4.3).The latter can be applied in parallel over the image while focusing on patches of the image with higher expected object density (Section 2.4.2).
Lastly, we propose a metric based on the Papangelou intensity that allows us to compare our approach to others and takes into account the object interactions in the likelihood of detection (Section 2.6).

Comparison to PP Based on Contrast Measures
To demonstrate the need for learned likelihood measures for our data, we evaluate a few contrast measures found in the literature [20,24] in the task of classification of objects and non-objects (Figure 8).Table 1 formulates these measures by denoting µ y , σ 2 y the empirical mean and variance of pixels inside shape y; µ s , σ 2 s mean and variance on shape contour s; n y the normal vector of contour s, and ∇X the image gradient.
To assert the viability of such measures for these data, we compute a score s(y) = exp(−V c (y) for proposed detections y (both ground truth and randomly scattered rectangles).The true objects represent 1% of this experiment's dataset (A higher proportion of true objects would allow high average precision for useless estimators predicting a constant value.).Here we consider around 32k object proposals (i.e., 313 true objects, and 31, 300 non-objects).A well-suited measure s(y) should allow classifying easily between ground truth objects and random, non-valid objects.We plot Precision-Recall curves for several of those contrast measures in Figure 8 both for a sample image of DOTA 0.5 and a synthetic highly contrasted image.The Average Prevision (AP) values are reported in Table 1.We compare with the proposed detection energy given by V pos + V a + V b + V α .The proposed measure, based on CNN output, outperforms classical contrast-based measures, as the CNN has learned to differentiate between relevant and irrelevant contrasted objects (Irrelevant objects include, for instance, AC units-a highly contrasted, car-sized object-or some pavement features.).

Models
In this article, we show results on two CNN-based models and two PP models.A Python implementation of our models is available at https://github.com/Ayana-Inria/PP-EBM(accessed on 31 January 2024).
• CNN-LocalMax.: Naive detection from the CNN backbone (used in the PP models); we find objects through local maxima in the output probability maps (or local minima in the energy maps).
• CNN-PP m : PP model with minimal inferred parameters: Pre-set priors (i.e., with manually set parameters), only the weight vector w e are estimated with the CD method.The values of manually set parameters are detailed in Table 2.
• CNN-PP w : PP model with the whole parameter vector θ (which encompasses the weights w e and priors' parameters) estimated with the CD method.
• BBA-Vec.and YOLOV5-OBB: Lastly, we compare all our models above with BBA-Vec.from [8] and YOLOV5-OBB from [46].These models are trained on the same data as the above-mentioned models.Positive and negative samples represent, respectively, 1% and 99% of the samples for which the metric is computed.Right: Precision-Recall curves for the measures proposed in Table 1; blue: t-test [20], orange: image gradient [24], red (our data term), green: constant value.

Table 2.
Values for manually set parameters.Some energy terms such as V a have no parameters.

Energy Term Parameter Value
V pos t pos 0

Results on Remote-Sensing Data
The models are evaluated on a test split of the DOTA 0.5 dataset (for quantitative and qualitative results) and on ADS (for qualitative results as these data are not annotated).Metrics are computed by matching the object proposals and ground truth using the IOU (Intersection Over Union) with a 0.25 threshold.The Average Precision (AP) metrics (which are threshold independent) are shown in Table 3 and are derived from Precision-Recall (PR) curves shown in Figure 9.We show detection results in Figure 10 3. Table 3.Average Precision (AP) values.AP 0 and AP N correspond, respectively, to the AP on the DOTA 0.5 and DOTA 0.5 +noise datasets.Precision-recall curves from which the AP is derived are shown in Figure 9.

Computational Complexity
Our two PP-based methods-running on an Nvidia Quadro RTX 8000 Graphics Processing Unit-execute in 300 s on average, for 16 k parallel iterations (equivalent to 77 k sequential iterations) for one image.With an efficient implementation, considering a density of 1.9 × 10 −3 (95th percentile of the observed object densities (i.e., 95% of observed object densities are below this value.)),we estimate the cost of one iteration to be around 5 operations per pixel per iteration, thus around 4 × 10 5 operations per pixel in total.We show the derivation of those values in Appendix B. As a point of comparison, Transformer models for image classification range from 5 × 10 6 to 1.8 × 10 7 operations (see [49]).The complexity could be greatly reduced by reducing the number of iterations of the Point Process sampler at the cost of straying away from the proper convergence properties.

Results Interpretability
Due to the decomposition of the total energy into energy terms introduced in ( 5), the object score can be decomposed similarly: with s e (y|y \ {y}) = exp(w e ∆V e (y → y \ {y})), the Papangelou intensity obtained by considering the single energy term e.This allows viewing the contribution of each component.Moreover, we propose grouping these contributions into the data and prior contributions to, respectively, obtain s data and s prior such that the final score is a product of the two: s(y|y \ {y}) = s data (y)s prior (y|y \ {y}).
In Figure 12, we illustrate how the two components of the score s can help analyze the results.Here we manually cluster the results into 4 basic categories: green objects correspond to detection with high prior and data scores (high s, top 25%), while blue detections have a higher data contribution (s prior < s data ).The yellow detections correspond to objects with lower data scores (s prior > s data ), often located in ambiguous locations.
Purple objects have a low total score s (i.e., low confidence, bottom 15%).These detections are often pruned out when using a score threshold (as used in Figures 10 and 11).This allows for identifying where the model is confident, either due to the data from the image or the interactions with nearby objects.One can access even more detailed information by looking into the individual scores s e for each energy term V e .

Quantitative and Qualitative Results
In Figure 10, we observe that despite noisy ground truth labels-a potential cause for over-constraining issues when using linear programming (see Section 2.5)-our models infer more regular configurations that align better with the physical constraints of the objects of interest.Sample 1 in Figure 10 BBA-Vec.struggles to properly align and detect all the objects.In sample 2, model YOLOV5-OBB shows patches of objects with no overlap regularization.Meanwhile, our model outputs regularized configurations even in these dense areas.Figure 10 (line 3) and Figure 11 showcase the robustness of our PP-based models when tested on noisy data or areas of limited information, such as shaded areas.The inference results on the ADS dataset, obtained with models trained on the DOTA 0.5 training set, highlight the model's ability to generalize to unseen, albeit similar, data.
Comparison between CNN-PP m and CNN-PP w against CNN-LocalMax.in Figure 10 and Table 3 reveals the improvement brought by the PP-based approach, as the performance of the CNN alone (CNN-LocalMax.) falls short of our proposed method especially when confronted with noisy data.
Lastly, the comparable performances of CNN-PP m and CNN-PP w , both qualitatively and quantitatively, highlight the efficacy of the CD method in inferring model parameters.This approach significantly reduces the need for manual parameter tuning (more specifically, prior parameters), where traditional PP methods, such as linear programming, fall short.

Limitations
Our method, while promising, faces several limitations that warrant consideration.First, the computational complexity involved in sampling a PP can be significant, particularly with dense configurations of objects.To address this, we could explore approximate sampling methods, although this might come at the expense of result accuracy, for instance, by implementing faster annealing techniques or discretizing the state space [48].
Second, while the model we present is tailored to a specific type of object, the underlying tools are adaptable to various object types.First, the PP framework allows for a wide diversity of objects to be modeled, thanks to the versatility of marks.Moreover, the energy formulation of our model allows for easy integration or removal of interaction models and priors as simple energy functions.For instance, we could incorporate classification as a mark in the Point Process model, utilizing features from the CNN or a prior law based on the geometric features of the objects (e.g., distinguishing trucks from cars based on length).
Lastly, our method relies on annotated data for training.However, energy models offer compositional flexibility; one potential avenue is to train the data terms of the model on annotated images while utilizing synthetic datasets (devoid of visual components) for prior terms training.This approach could help mitigate the reliance on annotated data and improve generalization to diverse scenarios.

Conclusions
In this paper, we present a method that integrates interaction models into objectdetection algorithms, leveraging deep convolutional neural networks.We address challenges inherent in satellite imagery, such as atmospheric disturbances and limited resolution, by incorporating prior knowledge about object arrangements.
While CNN-based methods excel at pattern extraction, they struggle with learning object-to-object interactions without complex attention mechanisms.Conversely, Point Process methods offer a promising alternative, addressing both object likelihoods and arrangement coherence.However, existing approaches relying on contrast measures have limitations.
Our method combines CNN pattern extraction with the Point Process framework, simplifying contrast measure computations into efficient energy maps.By leveraging pre-computed potential maps, we enhance Point Process sampling methods, improving exploration efficiency within the state space.
We depart from conventional parameter estimation approaches for Point Process methods like linear programming and adopt Contrastive Divergence to estimate energy weights and internal parameters automatically.This improves accuracy and reduces reliance on manual parameter tuning.
To facilitate model comparison, we introduce a novel scoring function based on the Papangelou conditional intensity, providing a more comprehensive evaluation metric that takes into account interactions and allows for the explainability of results thanks to the easy decomposition of the energy model.
Experimental results on diverse datasets demonstrate our method's effectiveness, particularly in noisy environments.Our model's ability to generalize to new data showcases its robustness and potential for practical applications.
In conclusion, our method offers a promising approach to object detection, combining the strengths of CNNs with Point Process models.It shows potential for improving accuracy and efficiency in satellite imagery analysis or other applications where the priors are strong: e.g.object tracking (priors on dynamics) or road extraction (geometrical priors).

Appendix C. FCN for Small Objects in Large Images
For our application, we need a Fully Convolutional Network (FCN) [52] capable of responding to small objects in large images.Definition A1.A Fully Convolutional Network (FCN) is a type of CNN architecture making use of locally connected layers such as convolution pooling or upsampling [53].
A straightforward approach to learning a position probability map Z pos would be to directly infer a heatmap of centers, i.e., obtained by applying a Gaussian blur on a binary map of object centers (Figure A1b).However, these maps are sparse (non-zeros probabilities are under-represented), and at inference, we observe some connectivity added in between blobs, making the separation of instances more difficult (see Figure A1b).
We circumvent this problem by learning a proxy task: The model is first trained to infer a map of 2D unit-length vectors H ∇ that point towards the closest instance center (Figure A1c) similar to [54].We then apply the divergence operator, as object centers now correspond to convergence points in this vector field [55] (Figure A1d): The backbone architecture corresponds to a Unet [28], which outputs for an image X a tensor F(X) with N F features per pixel.From that, feature representation is extracted Z pos and the Z κ m ∀m ∈ {a, b, α}.
In our experiments, we train the Unet to minimize the following cost function for position term [55] and mark terms:

Figure 2 .
Figure 2. Point Processes derived from simple energies; (a): V a (y) ∝ 1, (b): V b (y) ∝ y i , (c): V c (y|N y y ) ∝ min y ∈N y y d(y, y ), (d): V d (y|N y y ) ∝ V b + V c .The horizontal axis corresponds to coordinate i and vertical to j.

( a )
Energy on discrete marks (b) Energy on continuous marks

3 .Figure 6 .
Figure 6.Space partitioning, where each color corresponds to a different set s.Each cell c z s (square of size d c ), is the z th cell of set s.For every point y, we display its interaction radius d max as a semi-transparent circle.

else 12 :
Q c ← Q c,B with probability 0.5 else Q c,D pick birth or death 13:

θn 1 Figure 7 .
Figure 7. Effect of training the energy with contrastive samples generated from the current θ n , amplified for illustrative purposes, as the gradient step that updates θ n is small (Figure adapted from [26]).

Figure 8 .
Figure 8. Comparing contrast measures on synthetic (first line) and real data (second line).Left:Positive and negative samples represent, respectively, 1% and 99% of the samples for which the metric is computed.Right: Precision-Recall curves for the measures proposed in Table1; blue: t-test[20], orange: image gradient[24], red (our data term), green: constant value.
(lines 1-2 for DOTA 0.5 and line 3 for DOTA 0.5 +noise), with a score threshold set for each model to the one maximizing the F1 score.The ADS data inference results in Figure 11 are obtained with the models 583 trained on the DOTA 0.5 training set.

Figure 9 .
Figure 9. Precision-Recall (PR) curves for the models listed in Table3.

Figure 10 .Figure 11 .
Figure 10.Samples of detection on the test dataset.The score threshold (to not display low-score objects) is set to maximize the F1 score for each model.The first two lines correspond to samples from DOTA 0.5 while the third one corresponds to DOTA 0.5 +noise.
(a) Inferred configuration (b) Color correspondence

Figure 12 .
Figure 12.Inferred configuration on an ADS data sample (a), colored according to their prior/data scores plotted in (b): yellow s prior > s data ; blue s prior < s data ; purple low s; green high s.Each point in (b) corresponds to a detection in s data ,s prior space (log values scaled to [0, 1]).

Table 1 .
Various contrast measures from literature and Average Precision (AP) computed on the DOTA image (AP real ) or synthetic image (AP synthetic ) from Figure8.

Table A1 .
Number of theoretical operations depending on object density.
(a)Values of λ reported in TableA2.

Table A2 .
Values for density lambda .