How to Find the Key Participants in Crowdsourcing Design? Identifying Lead Users in the Online Context Using User-Contributed Content and Online Behavior Analysis

: Lead users are the most valuable innovation sources in crowdsourcing design; how to identify these users is a research hotspot in the ﬁeld of design and management. Existing approaches to discover lead users in the context of the online community, such as the manual method and ordering algorithm, have some limitations, for instance, low coverage and accuracy. To address these deﬁciencies, this article proposes a method that applies text-mining techniques, analysis of user behavior, and contributed content to identify lead users. We suggest a three-step analytical approach: First, a criterion system to evaluate the user’s leading-edge status is constructed. Second, we utilize a fuzzy analytical hierarchy process to assess the weighted value of each indicator and develop the reference sequence of the indicators. Third, grey relational analysis is employed to analyze the correlations between users’ indicators and reference sequences, and lead users are recognized based on each user’s correlation ranking. An empirical analysis is used to examine the effectiveness of the proposed method. The results reveal that the method has good precision and recall rate, can automatically process large-scale data, and has no strict requirements for respondents. Finally, the article discusses the limitations and provides possible directions for future research.


Introduction
In recent years, the flourishing of crowdsourcing triggered a transformation in the field of product development [1]. More and more companies have realized that customers are crucial external innovation sources and are expected to establish long-term relationships; some enterprises have started collaborating with customers to design/develop new products or services [2]. The advancement of Internet technologies has offered a stable channel for consumers to participate in design activities. Many firms, such as Lego, P&G, Haier, and Dell, have launched online platforms, i.e., the crowdsourcing design community (CDC). It encourages customers to contribute content (e.g., designs and ideas) via posting topics and messages, which is the main pathway for users to participate in enterprises' new product development (NPD) projects [3,4]. For instance, Dell initiated a crowdsourcing platform called Ideastorm for NPD. Dell posts its needs (e.g., bottleneck problems the company encounter during the process of NPD) on Ideastorm and invites users to contribute ideas, designs, experience, and knowledge to solve these issues. With the Ideastorm strategy, Dell's NPD productivity has increased by almost 50 percent; many of Dell's best-selling products are coming from Ideastorm [5]. The CDC extends the communication channel between companies and customers, providing a new approach for enterprises to continuously acquire new ideas and external knowledge [6]. Moreover, as a critical innovation force, users can exert significant effects on the success of NPD. In the research field of sustainable crowdsourcing design, scholars have acknowledged that companies should employ users with rich experience, extensive knowledge, and great skill concerning technologies and usage of various types of products as long-time partners to develop new products [7]. With these individuals' assistance, enterprises can effectively decrease the research and development (R and D) costs, shorten development cycles, increase the probability of project success, and improve design efficiency [8]. Eric von Hippel defined these individuals as lead users and considered customers of a product/service. The current experience needs will be general in a marketplace in the future and benefit gratefully if they obtain a solution to these needs [9]. These individuals may speed up product R and D and promote the sustainable development of enterprises. Lead users have two typical features: first, their needs represent the development trends of the product/service; second, they are keen to participate in design projects to translate their needs into new products/services [10]. Since lead users are the most valuable participants in sustainable crowdsourcing design activities, how to identify these individuals with high efficiency is an important research issue in the area of open innovation and information systems.
The CDC is the primary medium for enterprises to organize crowdsourcing design events. The operators may launch product development projects and design challenges within the community. Users can participate in their preferred activities and contribute content via posting feedback and messages to solve design-related issues. Additionally, the enterprise encourages users to interact with others in the community; many members often initiate topic discussions on technology, product improvement, usage experience, requirements, et al. and reply to other users' posts. Members' contributed content can fully reflect their capabilities, expertise, and active degree, which may assist the company managers in discovering critical users (e.g., lead users and opinion leaders) [11].
Research on the discovery of lead users in social media (e.g., social network sites and online communities) is still at the initial stages. In particular, strategies to identify these users in the context of online communities have not been sufficiently explored [12,13]. The existing literature primarily introduced two kinds of methods to discover lead users: manual screening and an ordering algorithm based on influence rank [11]. The main goal of the former method is to let the community members recommend lead users. It has multiple tools, such as surveying, interviewing, and discussing. Although the manual screening method has been extensively applied in the research area of product development, it has some limitations, such as low coverage, high cost, and intense subjectivity [9]. The latter method aims to identify the lead users by evaluating the frequency of content contributed by community members and their social influence. Such a method can be utilized to handle large samples sizes. However, their accuracy is relatively low. Most of the users identified by the ordering algorithm are opinion leaders who may have a poor understanding of technologies and usage of products [13]. To accurately and efficiently recognize the lead users in the context of the CDC, this work proposes an integrated method combined with user-contributed content and online behavior (mainly contribution behavior) analysis. The content analysis is performed using text-mining technology, while the behavior analysis is implemented adopting the statistical tool of online user behavior. The main contributions of this study are as follows: (1) We propose integrated criteria that measure individuals' expertise and active degree.
(2) Text-mining techniques are applied to extract product-related, innovation-related, user demand-related, et al. information from user-contributed content in CDC. (3) A ranking system based on fuzzy analytic hierarchy process (FAHP) and grey relational analysis (GRA) is developed to identify lead users. (4) We demonstrate the efficacy of our proposed methodology utilizing a case study of user behavior data from a well-known CDC in China.
To sum up, the critical contribution of this study is to propose an innovative lead user identification method. On the one hand, this method can help the traditional manual method quickly screen target users in the ordinary network environment (such as microblog, Facebook, and Twitter). On the other hand, it can quickly and automatically identify lead users in the specific network environment of a crowdsourcing innovation platform.
The remainder of this article is organized as follows. Section 2 reviews the relevant literature on lead user identification. In addition, Section 3 describes our proposed method in detail. Next, Section 4 presents the case study. Section 5 discusses the experimental results. Finally, Section 6 draws the conclusions.

Crowdsourcing Design in the Context of Online Platforms
Companies that desire to perform sustainable crowdsourcing design should continuously motivate users to contribute knowledge, ideas, designs, and innovations. An approach for enterprises to collect these resources is to run a CDC. Such a community often starts as a consumer support platform (e.g., brand community). Customers exchange information (e.g., usage attentions and tips) of the company's products and evolve into a way by which users can put forward suggestions on product improvement and develop extensions [14,15]. Companies may adopt some of the good ideas contributed by the customers and develop the products according to their needs [16]. Besides, enterprises often post the issues they meet in NPD and encourage users to contribute content to solve these problems. Sometimes users may provide unconventional but effective solutions. Although CDC has been widely used for implementing open innovation, previous studies have paid little attention to systematically exploring users' features and participation behavior in such a context. Therefore, more in-depth research is necessary.
Another method for firms to initiate sustainable crowdsourcing design is launching a design contest website [17,18]. Enterprises put their needs (for technology, products, e-commerce, et al.) on specific sections within the platform. Members can find the need that matches their innovation capability through the retrieval system [19]. The operators will choose the best designs/ideas as solutions for the needs and pay the members for their contributions [20]. Members will try their best to beat the opponents to get the reward. OpenIDEO hosted by IDEO, HOPE hosted by Haier, and Cuusoo hosted by LEGO are typical design contest website representatives [21].

Manual Method to Discover Lead User
Research on the lead user in product design and innovation is mainly focused on identifying consumers who contribute innovative ideas that are ahead of market preferences and trends [11,22]. In the offline context, manual screening of a significant number of potentially relevant customers is the primary method to evaluate and identify lead users [23]. Hippel et al. first probed the identification methodology and proposed a screening and pyramiding method to search lead users [9,24]. A representative sample or a predefined population is screened for users who satisfy a particular criterion via questionnaires to perform screening [9,13]. The examined sample should be sufficiently large to discover the real lead users. Pyramiding is the improvement of screening; it is a more targeted method that dramatically reduces screening efforts. To implement pyramiding, researchers must build the pyramid of expertise that contains three layers: the lead users, the users who have good knowledge of the product and can find the lead users, and the users who have an understanding of the product domain and may find the advanced experts [24]. The researchers may contact any users of these layers and follow the chain of user recommendations to find the next-level users. This method is effective since respondents with vital interests in specific topics tend to know the senior experts in the area [12].
Many scholars have further developed Hippel's strategy: Lüthje [25] empirically explored the features of innovation participants and considered that researchers should incorporate more indicators such as user influence, innovation ability, and forward-looking expectations to perform screening. Morrison et al. [26] applied leading-edge status to measure the users' level of expertise. Their research results revealed that applications' innovativeness is one of the most critical features of lead users. Tietz et al. [27] proposed a signaling method that utilized advertising tools to discover lead users; such an approach can broadcast the survey information to attract more target users. Brem and Bilgram [13] found in the sample of 24 lead user projects that screening, pyramiding, and signaling remained the most frequently applied search strategies to identify lead users. Hienerth and Lettl [28] explored the measurement of the lead user construct; they considered that social media, data mining, and modern search technologies might be employed to improve the effectiveness of the manual method. These works provide feasible manual methods for scholars to identify lead users in the offline context; however, such methods have some significant shortcomings: time-consuming procedure, high search costs, low sampling efficiency, intense subjectivity, and cannot contact all the lead users in the user space [11,12,29]. Hence, the manual method should be improved to suit the online environment.

Ordering Algorithm to Identify Lead User
With the rapid development of information technologies, many scholars considered that monitoring social media may replace the manual method to collect the information of customers [22]. They put forward a new method to measure user influence based on the user interaction relationship of a community network from the perspective of user interaction characteristics [30]. Tang and Yang [31] proposed a user-rank algorithm that combined content and network analysis to discover influential members in online communities. Song et al. [32] developed an influence-rank algorithm to identify opinion leaders in blogospheres; their method adopted social networks among community members, which are not always practicable in some online platforms. Hajian and White [33] proposed an index (Magnitude Of Influence (MOI)) to quantify the influence of online social network users on their neighbors, further using the PageRank algorithm to weight the MOI through the influence ranking of neighbors to determine the final influence ranking of the user. Tuarob and Tucker [11] developed a matching algorithm that connected the relationships between lead users and product features to identify lead users with special interests in certain areas. Pajo et al. [12] proposed a classification model to subdivide the users; such a method can reasonably identify the characteristics of different users. To optimize the identification of lead users, scholars have explored the additional features of lead users in the context of online communities. For instance, most lead users are influential and active members in cyberspace. They often generate product-related and service-related content, and their opinions represent most users' perceptions [34,35]. Although the ordering algorithms suggested by the existing works have improved the efficiency of lead user discovery, they have some drawbacks: (1) the definition of lead users in most of these works relates to how the users' views propagate throughout the online platform. In contrast, the lead users in the innovation field should be individuals who have extensive knowledge and unknown demands. That is to say, most of the existing methods are developed to identify opinion leaders [11,13]. (2) Most approaches need network connectivity among members, which is not always available in communities [11]. Thus, the ordering algorithm should be further developed to analyze the characteristics of lead users from the perspectives of knowledge and demand.

Research Framework
The CDC is constructed based on the online forum that motivates users to generate content and interact with other members. Within the CDC, enterprises disclose the designrelated problems they face through posts and encourage customers, fans, experts, et al. to contribute (through post image, text, and videos in the community) to settle these design challenges [3]. Additionally, operators also encourage members to post their demands in the forum so that the companies can understand the customers' trends of demand development and initiate new projects of product development [13]. Hence, the CDC users may generate a large amount of content that can reflect their capabilities. This work develops an identification method that considers multiple characteristics of lead users, including active degree, expertise, and quality of demands. The method contains three steps: first, we develop a criterion system based on previous literature to measure the features of potential lead users; second, text-mining techniques are utilized to collect user-contributed content and user's online behavior statistics from the CDC, and then, we apply the FAHP to evaluate the weight of the indicators and establish the reference sequence of criteria; third, the GRA is employed to calculate the correlation between candidate set (the potential users' indicator set) and reference sequence, and the users are ranked based on their correlations. At last, the top-ranking users (the enterprises decide the scales) will be considered lead users. After these steps, a case study is performed to verify the efficiency of the proposed approach. Figure 1 shows the research framework.
content and interact with other members. Within the CDC, enterprises disclose the designrelated problems they face through posts and encourage customers, fans, experts, et al. to contribute (through post image, text, and videos in the community) to settle these design challenges [3]. Additionally, operators also encourage members to post their demands in the forum so that the companies can understand the customers' trends of demand development and initiate new projects of product development [13]. Hence, the CDC users may generate a large amount of content that can reflect their capabilities.
This work develops an identification method that considers multiple characteristics of lead users, including active degree, expertise, and quality of demands. The method contains three steps: first, we develop a criterion system based on previous literature to measure the features of potential lead users; second, text-mining techniques are utilized to collect user-contributed content and user's online behavior statistics from the CDC, and then, we apply the FAHP to evaluate the weight of the indicators and establish the reference sequence of criteria; third, the GRA is employed to calculate the correlation between candidate set (the potential users' indicator set) and reference sequence, and the users are ranked based on their correlations. At last, the top-ranking users (the enterprises decide the scales) will be considered lead users. After these steps, a case study is performed to verify the efficiency of the proposed approach. Figure 1 shows the research framework.

The Criterion System for Lead User Identification
Since CDC users generate design-related content primarily through posting topics and feedback, the frequency of content contribution and the correlation between content and innovation may reflect the user's leading-edge status [2,12,25,26]. Additionally, some researchers suggested that the user's influence in the community may be a vital characteristic of the lead user [30,36]. However, from the analysis of online innovation platforms (e.g., CDCs and crowdsourcing websites), we noticed that many professional discussion topics posted by users who have extensive knowledge and great skill on technologies and innovation attract very little attention from other members. These users meet the criteria for lead user identification proposed by Hippel, but their visibility in the community is relatively low. Thus, we considered that social attributes (e.g., individual influence) are not the essential features of lead users.
Following previous studies [12,25,26], we employ characteristics of contribution behavior (e.g., contribution frequency) and correlations between user-contributed content and product, innovation, design, and technology as evaluation indicators to measure the individual's leading status. In particular, Guo et al. [3] considered that the ranking system of the online community, which is a kind of statistical tool, can well reflect the features of the users' online behavior. Hence, we apply this system to analyze users' contribution behavior.
Besides, text-mining and analysis techniques are utilized to evaluate the relationships between users' contributions and innovation.

The Indicators of Features of User's Contribution Behavior
Nowadays, most online communities have developed a ranking system that can reflect the member's reputation, active degree, and community influence by analyzing their online behavior such as posting, replying, likes, sharing, and consumption. The system gives the user a corresponding rank based on the statistics of individual behavior and assists the operator in managing the community. Table 1 shows the introduction of indicators from the standard ranking system. Table 1. Introduction of indicators from the standard ranking system.

Rank
The value of user rank. Title Virtual honor obtained by users when they reach a certain level.

Point
A behavioral credential that users obtain by using community, browsing, posting, purchasing goods, etc.

Contribution Value
Reflecting the depth of users' participation in online activities.

Virtual Currency
The rewards that users receive through contributing behavior can be used for virtual consumption.
As shown in Table 1, contribution value reflects the depth of community members' participation in online events. In the context of the open innovation platform, this indicator also reflects the breadth and depth of users' expertise and usage experience. A point is calculated based on statistical information of the user's online behavior, which describes the frequency of posting, replying, et al. and reflects the individual's active degree. Rank is a comprehensive reflection of the user's contribution level in the community, reflecting the user's relative position in the member group. These three indicators can well reflect the user's leading-edge status. Therefore, the contribution value, point, and rank are applied to measure the behavior characteristics of users in this work.

The Indicators of Correlations between User-Contributed Content and Innovation
The projects of NPD initiated by the enterprise are mainly carried out around the structure, function, and appearance; hence, the online comments from users that contain these contents are often focused on by the product developers [37]. Users who contribute such content may likely include lead users. Huang et al. [38] considered that the users' highly positive emotional trust affects users' decision-making in virtual communities. Li [39] proposed an ordering algorithm that can be used to evaluate the effectiveness of user comments. He suggested that the individual's reputation, number of thumbs up, timeliness, effective length, words of product features (i.e., attributes), and emotional words may be applied to estimate the correlation between online comments and innovation. Therefore, following prior literature and combining the analysis results of open innovation communities, we utilize product features and issues, emotional words, effective length, and timeliness of comments as indicators to evaluate the user-contributed content. Table 2 shows the introduction and calculation basis of indicators of contribution behavior characteristics and correlations between user-contributed content and innovation. Table 2. The indicators of criterion system for lead user identification.

Standard Categories Introduction Calculation Basis (Indicators)
Features of user's contribution behavior These indicators can be employed to measure user's interaction level, contribution frequency, product usage, etc., which reflect the individual's active degree and experience.

Contribution value point rank
Correlations between user-contributed content and innovation These indicators can be utilized to reflect the user's innovation capabilities, expertise, hierarchy, usage experience, etc.
Words of product features/attributes words of product issues/evaluative lexis emotional words effective length of comments timeliness of comments The quantitative methods for calculating the indicators of correlations between usercontributed content and innovation are as follows.

1.
The calculation of indicator of words of product features.
When users post their opinions (such as evaluation, demand, etc.) on products in the community, they often use words of product features to describe them. Attributes reflect the product's inherent characteristics, such as structure, appearance, etc.; most of these words are nouns. Therefore, we considered that when the comments contain product attributes, the comments may reflect product-related content (e.g., use experience, product problems, improvement suggestions, etc.). The more attribute words are included, the more information is transmitted, the more significant the auxiliary role for product development and improvement, and the higher the effective contribution level of users.
This study employs the single comment (i.e., a complete comment) users post as the analysis objects. A self-developed spider tool is used to collect users' comments from CDC, and we apply jiebaR to segment the collected Chinese texts into words and tag them with proper Part-of-Speech (PoS) tags (e.g., noun, verb, adverb, and adjective). After deleting the stopwords and punctuation characters, the comment is transformed into a word set U 1i = (u 11 ,u 12 , . . . ,u 1m ), containing N w words. The R programming language is utilized to match the words in U 1i one by one with the words of product features in a lexicon U 2 developed by the Institute of Computing Technology, Chinese Academy of Sciences [40]. When a word is matched, the number of attributes of the comment is increased by one. We use N a to represent the number of attributes in a single comment.

2.
The calculation of indicator of words of product issues The words of product issues (i.e., evaluative lexis) are often used to describe consumers' intuitive perception of the product functions, appearance, and other attributes. These words often reflect users' demand expectations and attitudes towards the product. For instance, "too large" in "car fuel consumption is too large" is the user's intuitive feeling towards car fuel consumption. Therefore, when a comment contains words of product issues, the users may post content about the product use experience, product defects, and personal needs. The more words of product issues are included in the content, the more detailed the description of the product problem, and the more profound users' engagement.
The product issues are often described with adjectives and verbs, usually used with adverbs. For example, within the comment "the motor performance is not good," "motor" and "performance" are words of product features; the adverb "not" modifies the adjective "good," and they constitute the word of a product issue. We apply the R programming language to identify the comments' adverbs, adjectives, and verbs. When an adjective or a verb appears in U 1i , the number of words of product issues of the comment is increased by one. Additionally, when an adverb appears in U 1i together with an adjective or a verb, the number of words of product issues is also increased by one. We use N i to represent the number of words of product issues in a single comment.

3.
The calculation of indicator of emotional words.
When users post their opinions on products in communities, they often express their emotional tendencies through their vocabulary. For instance, "perfect," "good," and "satisfied" express positive emotions, while "bad," "terrible," and "disappointed" express negative ones. The emotion expressed by the users on the functional attributes of the product can be regarded as an open test result for the product [41]. When users express positive emotions, it indicates that the product has satisfied their expectations in a particular aspect. There is no need to improve the product at the moment. However, when users express negative emotions, it indicates that specific attributes of the product have not met their expectations, and the enterprise should improve the product as soon as possible. Positive and negative emotions can reflect an individual's demand tendency and provide essential references for product improvement. Hence, when the comment contains emotional vocabulary, it may reflect the users' product evaluation and demand tendencies. Moreover, the more emotional words are included, the stronger the user's emotional trend is reflected.
We use the R programming language to match the words in U 1i with the emotional words in the lexicon U 3 of ICTCLAS [40]. When a word is matched, the comment's emotional words are increased by one. We use N e to represent the number of emotional words in a single comment.

4.
The calculation of indicator of the effective length of comments.
In the Chinese language environment, the length of an online comment is usually quantified by the number of Chinese characters included in the comment. However, most online comments contain a large number of meaningless content, and some of them include numerous characters that have nothing to do with innovation. Thus, we should apply the effective length of the comments to evaluate the leading-edge status of the users. In this work, the ratio of the number of emotional words, product features, and issues in the comment to the number of words in U 1i is utilized as the quantized value of the effective length of the comment. Meanwhile, to reduce the deviation caused by the abnormal length (e.g., too long or too short) of comments, the logarithm is used to weaken the difference of the denominator, as shown in Equation (1):

5.
The calculation of indicator of timeliness of comments.
The timeliness of comments refers to the difference value between the time when the user posts a comment and when the researchers fetch the comment; the smaller the value, the higher the timeliness [39]. The more content the user posts in a period, the higher the user's participation and the higher the user's leading-edge status. For product innovation, the more time-sensitive comments reflect the newer needs of users, and the less likely they are to be discovered and resolved by competing companies.
Meng and Ding [42] suggested that the newer the comments, the higher the credibility. The influential users' effect would diminish with time [43]. Based on the previous study's results, we divide those into two groups: the comments posted in the last three months and those posted three months ago. The former's value is 2, while the latter is 1.

The Ordering Algorithm of Evaluation Indicators
The identification of lead users is mainly achieved by ranking the leading-edge status of the users. The rules are as follows: the weight of each evaluation indicator is calculated by FAHP. Then, the optimal value of each indicator in the research sample is selected to form a reference sequence, and the correlation between each candidate user's indicator sequence and the reference sequence is calculated by GRA. The greater the degree of association, the higher the user's leading-edge status. Based on the research purpose and requirements, a certain relevance threshold can be set to distinguish between lead users and regular users.

The Calculation of Indicator Weight Based on FAHP
FAHP is a research method widely applied in analyzing and decision-making complex systems [44]. It can simplify complex problems into ordered hierarchical structures. In this work, such a method is used to determine the weights of various indicators. The analysis steps are as follows: Step 1: Construct a judgment matrix. The judgment matrix reflects individuals' thinking and judgment; it can be employed to collect the users' opinions on the weight of the indicators. Since the users of the CDC are the analysis objects of this work, the data that constitutes the judgment matrix mainly comes from the network survey of community members. The judgment matrix will be sent to the user's mailbox in a web questionnaire during the investigation process. Then, the members compare and score the importance of indicators according to their experiences and feelings. The scale applied in the matrix is the 0-0.5-1 standard; its descriptions are shown in Table 3. Table 3. Scale descriptions.

Scales
Definition Introduction The indicator i is more important than the indicator j a ij = 0. 5 Equally important The indicator i and indicator j are equally important a ij = 0 Unimportant The indicator j is more important than the indicator i a ij is the judgment value.
Step 2: Construct fuzzy consistent matrix. After constructing the judgment matrix, we utilize the method proposed by Korvin and Kleyle [45] to transform the matrix into the fuzzy consistent matrix.
The rows and columns of the judgment matrix are respectively summed, that is, After summing, transform each element in the matrix, that is, Step 3: Calculate indicator weights. The consistency test is performed to examine the fuzzy consistent matrix, and then the weight wi of each indicator ai is evaluated by the matrix, that is, In the Equations (2)-(5), m represents the number of indicators. Additionally, in order to improve the resolution of the sorting result, researchers often set A = (m − 1)/2. Finally, the average values of each indicator weight W = (w 1 , w 2 , . . . , w m ) are obtained.

The Ranking of User's Leading-Edge Status Based on GRA
GRA is a multi-factor statistical analysis method [46]. Its basic idea is to determine whether the correlation between multiple sequences and the reference sequence is close and then describe the relationship's size, strength, and order among factors according to the degree of association. For lead users, the greater the value of each indicator, the higher the leading-edge status. Compared with the traditional statistical analysis methods, the advantages of GRA are principally as follows: GRA is analyzed according to the development trend of the research objects. Therefore, there is no excessive requirement for the sample size, and no data are required to have a specific distribution law. The calculation amount is relatively small, and the result agrees with the qualitative analysis result [47]. The analysis steps are as follows: Step: 1 The dimensionless processing of the data. X k (i) and Y(i) represent the user's indicator sequence and the reference sequence, respectively; k = 1, 2, . . . , n, i = 1, 2, . . . , m. n represents the number of users to be ranked; and m represents the number of indicators. Since different indicators often have different dimensions and orders of magnitude, direct comparisons cannot be made, and normalization is required. In the related research of GRA, scholars often use the minmax method for dimensionless processing. However, since the data composition in the criterion system is quite complicated, the magnitude of the difference between the different indicators is enormous, and the min-max method is not applicable in our work. Hence, in order to eliminate the singular data, make the data index in the same order of magnitude, have comparability, and make it suitable for comprehensive comparative evaluation, we employed the averaging method to perform the dimensionless processing to the original data, that is, X k (i) is the quantized value of the i-th indicator of the k-th user; x i is the average value of the i-th indicator of the candidate users' sequence; y i is the quantized value of the i-th indicator of the reference sequences. The users whose indicator values are the same as the corresponding indicator values in the reference sequence need to be eliminated to avoid invalid results.
Step 3 The calculation of grey correlation. The weighted method is used to calculate the gray correlation, and the formula is as shown in Equation (9): In the equation, w i is the value of the indicator weight obtained by FAHP, Sorting each user's correlation, we can then obtain the ranking of the user's leadingedge status. We are setting the relevance threshold (dynamic) according to the semantic environment and the specific purpose of the research. Then, we can distinguish between lead users and regular users.

Data Crawling
In this work, a case study of lead user identification is conducted to verify the effectiveness and practicality of our proposed method. We examine the method with samples collected from a CDC, Xiaomi Forum, with over 50 million registered members and about 65 percent active users [3]. Xiaomi is a mobile Internet company focused on designing and manufacturing smartphones; its CDC collects many valuable ideas and designs from community members [2]. Hence, it is an ideal source for the research. Figure 2 shows the user interface of Xiaomi Forum.
In the equation, wi is the value of the indicator weight obtained by FAHP, ∑ = 1 . Sorting each user's correlation, we can then obtain the ranking of the user's leading-edge status. We are setting the relevance threshold (dynamic) according to the semantic environment and the specific purpose of the research. Then, we can distinguish between lead users and regular users.

Data Crawling
In this work, a case study of lead user identification is conducted to verify the effectiveness and practicality of our proposed method. We examine the method with samples collected from a CDC, Xiaomi Forum, with over 50 million registered members and about 65 percent active users [3]. Xiaomi is a mobile Internet company focused on designing and manufacturing smartphones; its CDC collects many valuable ideas and designs from community members [2]. Hence, it is an ideal source for the research. Figure 2 shows the user interface of Xiaomi Forum. We applied a self-developed spider program to collect 9500 users' comments (including topic posts and feedback) and their recent ID information (including contribution value, rank, and point) from 1 November 2018 to 20 May 2019. The R programming language and SPSS tool were utilized to perform the analysis. We applied a self-developed spider program to collect 9500 users' comments (including topic posts and feedback) and their recent ID information (including contribution value, rank, and point) from 1 November 2018 to 20 May 2019. The R programming language and SPSS tool were utilized to perform the analysis.

Data Analysis and Results
Among the related comments published by the 9500 users, the maximum number of occurrences of words of product features in the user comments is 1722; the number of words of product issues is 207; and the number of emotional words is 1079. The statistical value is processed by R programming language regarding the effective length of comments; and the maximum quantized value is 0.697. In terms of the timeliness of comments, the maximum average value of all users' posts is 1.7. In terms of the contribution value, the highest value is 11,328. In terms of the rank, the highest value is 8. In terms of the point, the highest value is 213,176. We applied these values as the reference sequence: {1722, 207, 1079, 0.697, 1.7, 11328, 8, 213176} Next, based on the data collected through the social survey, we utilized FAHP to calculate the weight of different indicators for lead user identification and then constructed the judgment matrix as shown in Table 4. Then, we used Formulas (2)-(4) to convert the judgment matrix into fuzzy consistent matrix, as shown in Table 5. After the consistency verification of the matrix, the weight value of each evaluation index was calculated by Formula (5) Afterward, we used GRA to calculate the quantitative value of each user's evaluation index, compared the correlation degree between the sequence and the reference sequence, and then realized the user criticality ranking by comparing the correlation degree. Since Hippel considered that only about three percent of customers are lead users [9], we took the top three percent of the 9500 users as the lead user.
To verify the validity of the method set out in the present study, we compared the recognition results of the manual method with our approach. Following the method suggested by Brem and Bilgram [13], we selected 500 Xiaomi Forum users who have used the community for more than one year and sent a survey invitation to them through the mail system. We required them to select the 50 most representative community members based on their feelings and experience. The respondents rated these users in terms of expertise, experience, demand, and participation level (each scored 1/4), and the top 20 users were ranked as the lead users of the community. In order to ensure the effectiveness of the comparison verification, a 60 day online behavior tracking was conducted for these 20 lead users. A total of 272 respondents returned valid information.
These posts were analyzed by seven scholars and technical experts from enterprises/ colleges. They found that in addition to the five users ranked 1st, 4th, 10th, 13th, and 19th, the other 15 users had a high leading-edge status. Through the analysis, we found that each of these users posts at least ten topics per week, and most of the posts are related to market, product, and technology. However, although those five users post many topics and are well-known in the community, only a tiny part of their posts are related to technology, market, experience, demand, etc. Therefore, they are just simple active users. As shown in Table 6, the top 10 lead users identified by our method are compared with the lead users

Key Findings
The results revealed that the identification method proposed in this study has good precision and recall rate. Besides, it was evident that the comparison showed some differences between the two methods, owing to the manual method being highly dependent on manpower. It takes considerable time to find qualified respondents. Then, in the analysis stage, the participating users will be affected by their own perception and cognitive judgment when selecting the lead user. If there is a lack of interactions between lead users and other users, in this case, their influence is limited, and other users lack a basic understanding of them; such users will not be recognized. According to Table 6, so far as we know, ID 437500596 and 23957255 contributed many product experience posts and product evaluation posts in the community and have a rich user experience and product knowledge. However, due to the relative absence of interactions between these two users and other users, they have not attracted widespread attention, so they have been ignored. Therefore, the manual method needs to carry out cumbersome data processing to reduce the subjective feelings of the survey results. The whole process is complicated, and it is difficult to guarantee the quality of the survey.
Compared with the manual method, the approach suggested by our work mainly uses the contribution content and statistical information retained by the community members to identify the lead users. That is because a series of suggestions and comments generated by user contributions in CDC can generate valuable and novel solutions [48]. Furthermore, the technological progress of machine learning technology for natural language understanding, such as semantic word space model and semantic network analysis, made it feasible to capture open text content on the Internet. In order to overcome the high dependence of traditional lead users identification methods (such as manual screening and ordering algorithm) on manual recommendation, we constructed a criterion system for lead user identification for the field of innovative product design. It can combine with the judgment mechanism of artificial methods and automatically process large-scale data with a machine learning algorithm, without strict requirements for respondents. Therefore, our method has advantages in efficiency and accuracy. It can automatically process large-scale data and has no strict requirements for the respondents. Hence, our method has advantages in terms of efficiency and accuracy.

Theoretical and Practical Implications
With the application and development of the new generation of Internet and digital technology, economic subjects' interaction models and information matching modes are undergoing profound changes. The integrated development of crowdsourcing and innovative design has quietly subverted the traditional industrial structure. It is the reflection of enterprises on the innovation model. A crowdsourcing design allows enterprises to assign some or all of their design tasks to organizations or individuals with the appropriate capabilities and resources through online platforms and then collaborate to complete the work. It uses a series of means to give play to the wisdom of platform users, sustainably optimizes the allocation efficiency of knowledge resources and technical resources emerging in the implementation of crowdsourcing activities, shortens the response time, and realizes the rapid matching of supply and demand information. It also helps enterprises create better products or services that meet the market and consumers' expectations. Therefore, crowdsourcing significantly improves the efficiency of resource allocation and labor productivity, reduces enterprises' development costs, and promotes the collaborative value creation of crowdsourcing networks.
At present, numerous studies focus on the link between environmental aspects of sustainability and crowdsourcing [49,50]. As a valuable tool, crowdsourcing can successfully attract diverse stakeholders to generate novel ideas and develop these into sustainable solutions [51]. A previous study on the influence of the two sustainability dimensions of environmental and economical on consumer responses found an interaction between consumer support for sustainability and enterprise sustainability [52]. Hence, crowdsourcing could favor an improvement in environmental sustainability performance and economic and social ones. It is worth noting that the digitalization process strengthens the connection between products and factories, the value chains, and users to achieve a production cycle that is as sustainable as possible. Thereby, in the new development stage, business, information, engineering, and analytics perspectives on digitalization are connected [53], which could promote the sustainable development of the digital crowdsourcing economy to support the high-quality development of the economy.
As we mentioned above, this study contributes to research on product design by providing a new method that can automatically discover lead users in CDC. We developed a criterion system that includes the indicators of user contribution contents and online behavior to measure the leading-edge status of the CDC members. Then, an ensemble method that incorporates combination weighting and correlation analysis was constructed to analyze the indicators to search for the lead users. The experiment results confirmed that the proposed method could accurately and efficiently recognize lead users. Additionally, we also explored the characteristics of CDC members' contribution behavior. Since the entire crowdsourcing design process was implemented online, most of the participants' behaviors, such as post topics and feedback, were shown to the community managers. These behaviors can well reflect the individuals' creativity, which may assist enterprises in finding suitable candidates as partners in the NPD projects.
Our study also provides operators with a set of practical implications for the management of CDC and lead user search. First, our method does not need to implement large-scale social surveys, reducing the operators' manpower costs in recognizing lead users. The potential groups of lead users can be automatically identified by the method, and then enterprises may choose suitable collaborators from these groups according to product development needs. Second, the CDC managers can provide the identified lead users with incentives to enhance their loyalty, which may ensure the stability of the NPD projects.
Therefore, our study has significant theoretical value and practical significance. Firstly, inviting users to participate in the design process is the need for innovative product R and D. In the field of open innovation, mobile Internet and the community have realized a crowdsourcing model with multi-role and large-scale social-ecological interaction. Crowdsourcing encourages users to integrate into the development process of innovative products and services, cross-discipline barriers, and quickly obtain users' needs. The participation of more users can enable enterprises to obtain more affluent user needs, which are the traction and driving factors of new product design. The viewpoint of user contribution can guide enterprises to meet design needs and carry out sustainable development continuously.
Moreover, in the era of a knowledge network economy with developed Internet, only depending on the internal resources of enterprises for innovation activities cannot meet and adapt to the growing social market demand. As a critical external knowledge resource, lead users can master the key technology of innovative product design and a large amount of open external knowledge and can have the future demand to produce innovative results. Crowdsourcing provides enterprises with new opportunities to support the integration of lead users. Therefore, recruiting lead users can enable enterprises to obtain extensive and objective innovative ideas, generate more creative cases, and improve new products' innovation quality. Furthermore, they can assist enterprises to achieve sustainable and open innovation.
In addition, the lead user identification method also has a certain degree of applicability in the crowdsourcing platform based on English. Firstly, this method can be extended to the English environment from the operational level. However, it should be attentive to the characteristics of the English language environment. Due to the significant differences between English and Chinese, the evaluation method and judgment mechanism need to be adjusted in combination with the context. In addition, the difference in grammatical structure between English and Chinese leads to the difference in thesauruses between the two languages, so it is necessary to build the corresponding thesaurus in text content analysis. Therefore, after completing the relevant optimization work, the research method proposed in this article can be popularized.

Conclusions and Limitations
Lead users are the most valuable customer groups in the NPD. Therefore, accurately identifying and locating lead users is significant for enterprises to effectively organize and manage sustainable open design activities. This study proposes a lead user identification method based on user behavior data and contribution content analysis and constructs a criterion system to evaluate the user's leading-edge status. Our approach has several advantages, such as high efficiency, accuracy, and coverage compared with the manual method. The effectiveness of the proposed method in this work is verified by comparative analysis.
Some limitations restrict this study. The method of the article is to identify lead users by sorting the correlations of the community members. The greater the degree of relevance, the higher the user's leading-edge status. However, selecting the relevance threshold to distinguish between lead users and regular users is not yet clear. Based on Liao's research results [54], we considered that the operators might determine the threshold by two methods: 1. The total amount (S) of valid information in the content contributed by the user can be used as the threshold value. When the user's contributed valid content is more than S, she/he can be regarded as the lead user. The selection of the S value depends on the improvement rate of the product proposed by the enterprise. 2. According to the ranking of correlations, the top F percent of the candidate users are lead users. The selection of the F value depends on factors such as the size of research samples, the extent to which the company intends to improve the product, and the size of the potential customers who may purchase the improved product. Future research may verify these two threshold determination methods and discuss the context in which each method applies.