Stochastic Optimal Control Models for Online Stores

We present a model for the optimal design of an online auction/store by a seller. The framework we use is a stochastic optimal control problem. In our setting, the seller wishes to maximize her average wealth level, where she can control her price per unit via her reputation level. The corresponding Hamilton-Jacobi-Bellmann equation is analyzed for an introductory case. We then turn to an empirically justified model, and present introductory analysis. In both cases, {\em pulsing} advertising strategies are recovered for resource allocation. Further numerical and functional analysis will appear shortly.


Introduction
Auctions are a natural way to assign resources to selfish agents who compete for resources. These include diverse examples such as: wireless spectrum, arts, electronic equipment, songs, edges of a publicly known network [8,27], or some other digital goods [15]. Please refer to the monograph by Milgrom [23] for an overview of auction theory. Online auctions have been established as electronic mechanisms in commercial transactions. Over a very short time, the Internet has become a formidable tool to connect buyers and sellers separated by large physical distances.
In this electronic arena, online sales have increased in orders of magnitude beyond imagination. Online auctions, as a tool for commercial transactions, have been established and used for maximization of auctioneer's profit [1,6,12,18,34,19,13,7,11]. On the other hand, the growth of the online commercial environment has come at the price of anonymous transactions where buyers have less recourse in protesting bad or unfair service. Online auctions are substantially different than traditional offline auctions [9,33,24,21], and apropo that fact their bidders behave differently [4]. The presence of feedback forums, such as Amazon, eBay, and Yahoo!, at online auction sites, provide possibilities to rank other agents, get informed, and be ranked. Naturally, the online comment form has become vital in choosing which seller to purchase from. If the seller's reputation is damaged via negative feedback, she may expect a smaller sales rate. Reputation should simply mean that the higher the reputation of an agent is, the more confident that agent is. The notion of reputation helps agents, or the system designer, to optimally make their decisions, or to optimally design the mechanism, respectively [16,10]. The notion that reputation affects the market outcome in the online settings has been studied [3,31,30], and there has been study on the optimal bidding strategy in sequential auctions [2].
In an online store, if the agent has profited greatly from dishonest transactions, or even decreased spending of resources on (a) advertising or (b) following up on sales to encourage higher reputation scores, this immediate increase in wealth may offset the long-term damage from lower reputation. With this in mind, one considers the existence of an optimal long term strategy. Using such a strategy, the seller can maximize her average expected wealth at a time of her choosing. Our work approximates the actions of a seller as being continuous. Such an approximation serves as an entry point and has the benefit of closed form solutions in some cases and the power of functional and numerical analysis in others. The results we present are encouraging and intuitively satisfying, and suggest more study in both discrete and continuous settings.
We note that our version of reputation evolution follows the standard Nerlove-Arrow construction [26], first postulated in 1962, and then extended to a stochastic setting by Sethi [32], and others. In Section 2, we propose an optimal finite horizon strategy, based on certain modeling assumptions for a seller to maximize her final average wealth. In this case, we model reputation, or goodwill, as a geometric Brownian motion. This process is of course familiar to those who work in quantitative finance, and the reader can see that it does indeed fall within a Nerlove-Arrow construction [26], if the adapted control, the advertising rate, is taken to be proportional to reputation level. The resulting control is of switching, or pulsing, type and is familiar to those who work in optimal advertising strategies. In Section 3, i.e., in our second model, we propose a finite horizon strategy that is based on the empirical work of Mink and Seifert [25], and we use for our reputation evolution the model proposed by Rao [29] and used by many others in optimal advertising mod-els, such as Raman's recent work in Boundary Value Problems in the Stochastic Optimal Control of Advertising [28].

Implicit Resource Allocation Mechanism: Finite Time Horizon
For the probability space (Ω, ℑ, P), define B as the standard Brownian motion that lives on this space. Take the current state (W, R, µ) as the Wealth, Reputation, and measure of resources spent on advertising or promotion, respectively. In this scheme, R is a positive number reflecting customer satisfaction. By choosing to shift up to 100ǫ percent, ǫ ≪ 1, of her resources from promotion to processing and back, the seller can influence her wealth and reputation levels via µ. Positive µ corresponds to a promotional state; µ < 0 corresponds to a processing state. The evolution of wealth and reputation is modeled over an interval [t, T ], where 0 ≤ t ≤ T , by the two-state Markov process: The general model that we propose is to solve for u(W, R, 0), where ρ > 0 is the discount rate and This control problem corresponds to a store with (a) no salvage value upon closing or transfer, see [28], and (b) no switching costs proportional to µ 2 as we have the a prior bound |µ| ≤ ǫ ≪ 1. Our approximation is to assume a large inventory with an almost constant number of units sold per unit time. Under this assumption, we consider the growth rate to be the product of two factors: . This product, revenue per unit time, is the price per unit h(R) multiplied by a resource rate factor r(µ) = 1 − µ, the number of units sold processed per unit time. Such a factor r < 1 (µ > 0) represents the diversion of resources to advertising in return for a larger reputation score and larger potential sales price per unit in the future. If the seller ignores her advertising duties and diverts resources to processing, i.e. if µ < 0, then r > 1 and we have an immediate revenue increase in return for lower reputation score and lower future sales price. It is implicitly assumed that this immediate loss (gain) can be attributed to spending (absorbing) an extra 100µ percent of time/resources on raising reputation scores (processing.) A balance is sought between these two competing factors h and r.
Finally, we model the evolution of the seller's reputation as a Geometric Brownian motion. This is done to reflect the asset-like nature of goodwill, and is only one of many approaches. In the next section, we propose a related control problem that uses the stochastic Nerlove-Arrow model [26]. This model has been used with success in the analysis of optimally planning advertising campaigns, and the resulting control is a pulsing, or switching, strategy that is similar to the one found using our Geometric Brownian dynamics for goodwill (reputation). Of course, the reader might notice that the Geometric Brownian motion model for the controlled reputation mechanism is also a Nerlove-Arrow type dynamic [26], if the control is taken to be proportional to the current level of goodwill, and if the stochasticity added also has a diffusion term σ(R) that is linearly proportional to reputation R, instead of just constant. With these modeling constraints, the corresponding optimal control problem becomes Then, the corresponding Hamilton-Jacobi-Bellmann (HJB) equation is Basic algebra now leads to the following set of equations For general functions h(·), we are committed to the theory of partial differential equations. In the case where h(·) takes on a special form, we obtain simple, closed form results, via separation of variables. For example, we consider a power law model in the next section.

Power Law Model for the Growth Rate
Consider the growth rate G(R, µ) = (1−µ)R γ . This case corresponds to an object that has no inherent value if the sellers reputation is zero. As h(0) = 0, we expect that lim R→0 v(R, t) = 0. We can in fact show this via the following lemma.
Proof. Since µ ≤ ǫ, standard SDE theory [17] implies that for all which proves the assertion of the lemma. This can also serve as an upper bound for v(R, t).
As it is sometimes possible with optimal control problems in finance and operations research, we assume an Ansatz for the solution of (1). Then, separation of variables leads to this solution-control candidate pair of the form v(R, t) = e −ρt ψ γ,ρ (t)R γ µ t = ǫ · sgn (γψ γ,ρ (t) − 1) .
We now prove the following theorem.
Interestingly, this control is independent of the reputation level R, and depends only on the time remaining. Also, the reader may notice a similarity of our approach to the iconic Merton Portfolio problem [22]. In our model, we multiply the power law growth rate by the processing rate dependent on the control, and we bound the size of our control. Still, it is not surprising that the solution is obtained by the same method, separation of variables, and has the same power law dependence on the underlying asset, in our case reputation. Unfortunately, the general h(R) case does not express the same solution form.
Consider now For the sake of completion, we now verify for T − 1 ǫ ln (1 + ǫ) ≤ t ≤ T that In fact, direct substitution leads to which is the same as the corresponding quantity in (4). A similar computation can be made for 0 ≤ t < T − ln (1 + ǫ)/ǫ to verify that the solution of the partial differential equation coincides with the expectation of the stochastic integral. Also note that as ǫ → 0, we have lim ǫ→0 v(R, t) = (T − t)R. This is expected, as The study of bang-bang optimal control problems has a deep history, along with many approaches. A very useful and well studied path can be found in the work of [5], among others. This method solves for the sign of the gradient of value function, ∇v, using a Girsanov transformation [5], and the general solution follows. For a more detailed study consult [20].
Also, the reader may notice a similarity of our approach to the iconic Merton Portfolio problem [22]. In our model, we multiply the power law growth rate by the processing rate dependent on the control, and we bound the size of our control. Still, it should not be too surprising that the solution is obtained by the same method, separation of variables, and has the same power law dependence on the underlying asset, in our case reputation. Unfortunately, the general h(R) case does not express the same solution form, whether we use a geometric Brownian motion or stochastic Nerlove-Arrow dynamics [26], see Section 3.
For sake of comparison, one could also analyze this problem as a single variable problem; the decision of when to make a single switch. For completeness, we present this in Appendix and show that solution via this method is the same as the HJB approach.

Explicit Resource Allocation Mechanism: Empirical Model of Mink and Seifert
In the previous model, we assumed an implicit mechanism for generosity. However, an explicit mechanism has been proposed in the novel work of [25]. There, the authors propose and empirically justify a growth rate of where A relates to the inherent value of the object for sale and C is a parameter to be fitted. This is accomplished by obtaining data using an auction robot and then computing a single regression, which gives C = 2.50 in (3). It should be noted that in many economic papers on the effect of reputation on bidding and final sales price, the Mink-Seifert model is the first one we found to give an explicit relationship between reputation and price. The Mink-Seifert model also suggests a multiple regression formula where other factors, such as shipping costs and whether a "buy-it-now" price is offered, are considered as well. In that case, C = 1.93 and the authors in [25] comment this implies ". . . the correlation between a seller's revenues and her feedback score can be attributed to a large part to the fact that highly experienced sellers both have a higher feedback score and design the auction more favorably". In fact, they show that the coefficient attributed to shipping costs is larger than one, implying that customers put a high value on shipping when deciding on their bids, and that savvy agents take this into consideration. Finally, they posit that the horizon T does not affect the revenue stream as much as the shipping cost and reputation factors, and so we consider an arbitrary finite horizon model here.
As an initial approach, we consider only the effects of reputation, and leave the more general model with shipping costs for future work. Also, we consider the reputation mechanism first suggested by the work of Nerlove and Arrow [26] and later generalized to stochastic settings by, for instance, [32,29,28]. The stochastic wealth growth rate is given by where κ is a proportionality constant first proposed in [26], and γ represents the maximum premium earned over the inherent value A by the seller due to reputation. Given our initial approach in the previous section, where we took p(µ) = 1 − µ, for µ ∈ [−ǫ, ǫ], we can now extend the processing rate p(µ) given expenditure. For a finite horizon, a company may be interested in capping its expenditure rate, and may only siphon a small amount away from the advertising budget as well in order to allocate that resource to processing instead. In general, we can have a rate p(µ) and a maximum possible processing rate M that satisfy Our previous control problem (2) returned a switching, or pulsing, type advertising campaign that was only time dependent. This made the subsequent analysis to verify the HJB solution easier. With the stochastic Nerlove-Arrow [26] dynamics for reputation and Mink-Seifert [25] growth rate for sales, we expect that switching would depend on both reputation and time. To investigate this, we again consider: (i) p(µ) = 1 − µ, and (ii) a seller who wishes to cap her advertising expenditure, but now as a total rate that is not necessarily proportional to her current reputation level The corresponding nonlinear HJB boundary value problem is The candidate for optimal control is again a pulsing strategy, switching between the values ǫ and −ǫ is given by .
The resulting boundary value problem becomes Naturally, the strategy will switch dependent on both the time and current reputation state. This is different from the power law model as that led a time-only dependent pulsing strategy. Of course, the solution of the HJB equation is only a candidate for the solution of the control problem, and it must be verified as in the previous section. The solution of this HJB equation, however, may very well be non-smooth, and so the notion of generalized solutions (see [14]) must be visited. Further analysis on this equation is forthcoming.

Conclusion and Future Work
In this work, we defined a general framework for the problem of selling goods online when buyer feedback factors into the sales rate. The framework was a stochastic optimal control problem, where the seller wishes to maximize her average wealth level at a fixed time of her choice. We presented a method to incorporate the optimal design of an online auction (online store) by a seller in the presence of reputation management. Then, we introduced and analyzed the corresponding Hamilton-Jacobi-Bellmann equation. To obtain intuition, we first analyzed a model where the revenue per sale, or the price per unit, was dependent solely on reputation, and multiplied by a mark-down factor. Such a factor represented the loss per item the seller could expect for behaving generously today in return for a larger reputation score and larger potential revenue in the future. As an example, we considered a power law growth rate function, which again provided insight into the general solution method.
Subsequently, we have shifted our attention to an empirically justified model, and proposed that a seller might optimally design her online store in accordance with the Mink-Seifert model, an explicit mechanism proposed in the novel work of [25]. There, the authors propose and empirically justify a price per unit of h(R) = A+C(1−1/ ln (e + R)), where A relates to the inherent value of the object for sale and C is a parameter to be fitted that represents the maximum premium expected due to reputation over inherent value. This is accomplished by obtaining data using an auction robot and then computing a single regression, which gives C = 2.50. It should be noted that in many economic papers on the effect of reputation on bidding and final sales price, the Mink-Seifert model is the first one we found to give an explicit relationship between reputation and price.
We have incorporated the reputation mechanism first suggested by the work of Nerlove and Arrow [26] and later generalized to stochastic settings in [32,29,28]. With these modeling considerations, the bivariate Revenue-Reputation Markov process. Naturally, the strategy is expected to switch dependent on both the time and the current reputation state. This is different from the power law model, which led to a time-only dependent pulsing strategy. Further analysis on the resulting HJB, including numerics, are forthcoming.