\title{Technical Content Writing Evaluation Test}
\author{ }


This test is to evaluate the candidate’s technical content understanding, conceptualizing and writing skills. Two paragraphs are given below related to network data representation. It is required to write a single summary for which links the intent of both paragraphs in own 700-800 words in the form at the bottom of this webpage. Kindly avoid plagiarism too.

\section{Paragraph 1}
Networks can be widely seen in real world, which are one of the most complex data, e.g., transportation networks, social networks and biological networks [10,16,36]. Networks can be modeled as graphs which are composed of nodes and edges. For instance, the individuals or organizations can be represented as nodes in social networks, while the interactions are depicted by the edges. The analyses on the network structures are beneficial for the exploration and understanding of networks.

In networks, one of the most remarkable characteristics is the community structure, such as the overlapping community structure [25] and the hierarchical community structure [19]. A community is a group of nodes which are tightly connected with each other but rarely connected with the outside nodes. Community detection is a significant task for many applications, such as web page clustering [9], musical rhythmic pattern extraction [6] and protein function prediction [30]. Besides, some assessments of a partitioned community structure are researched [17,18,23].

Figure 1
Figure 1: Flowchart of the nonnegative matrix factorization based community detection algorithm. The topology information in the network can be encoded
as an adjacency matrix. By decomposing the adjacency matrix into two nonnegative matrices, a low-dimensional representation can be achieved. Finally,
nodes can be classified into different communities based on the new representation.

In the past few years, numerous community detection methods have been proposed. Among them, modularity-based algorithms are particularly popular, e.g., fast greedy modularity optimization [5], extremal optimization [7] and spectral optimization [24,44]. Another category is divisive algorithm, e.g., GN algorithm [10], which removes the intracommunity edges to obtain clusters. Many dynamic algorithms also proposed with the merits of low computational complexity, e.g., label propagation (LP) algorithm [27], Potts model. Besides, some overlapping community detection algorithms were proposed [17], such as seed expansion approach [37] and fuzzy clustering [32]. However, most of these methods only utilize the
topology information. When it comes to complex and vague structure with many intracommunity links, the performance of these methods will be greatly affected. In order to address this problem, some research works try to make use of prior information to improve the detection performance in a semi-supervised manner. Cheng et al. [4] presented a semi-supervised method based on active learning to generate high-quality must-link and cannot-link constraints. Eaton et al. [8] proposed a semi-supervised algorithm, which is based on the spin-glass model. Liu et al. [20] combined the idea of graph regularization with pairwise constraint and proposed a semi-supervised nonnegative matrix factorization model. However, the real-world scenarios usually lack the desired prior information and it would consume much manpower to access the prior information. In addition, the performance of semi-supervised algorithms heavily depend on the quality of prior information, which might not be available in practice. Hence, given that semi-supervised methods are case-dependent and the prior information is scarce, the unsupervised methods are more desirable.

Without the prior information and human intervention, the unsupervised algorithms are self-organizing. More recently, one of the unsupervised algorithms, nonnegative matrix factorization (NMF) [26] based community detection, has attracted huge attention. Different from the previously described methods, NMF based method is a fuzzy community detection method where each node has a soft membership to each community. Inspired by the generative process of networks, NMF based method supposes that one network can be divided into a number of low-dimensional subspaces. The coefficient vectors in the new space are the soft membership vectors, which decide relationships between all pairs of nodes and communities. Fig. 1 shows the flowchart of a typical NMF based community detection method. Recently, theoretical advances on NMF enable some variants of NMF to be applied to community detection, such as symmetric NMF (SNMF) [34] and nonnegative matrix tri-factorization [43]. However, these methods only aim at minimizing the error between the factorized matrices and the original matrix, which ignore the underlying information in the network. Nodes within the same community may have totally different low-dimensional representations as shown in Fig. 2(b). Nodes 28 and 30 are two connected nodes, which belong to the same community. However, for nodes 28 and 30, the representations obtained from NMF based algorithm are totally different, which results in the misclassification of node 30. In order to solve the above problem, He et al. [13] proposed an extension of SNMF named graph regularized SNMF (GSNMF) method, where graph regularizer could smooth the variation in new space between two nodes with large similarity [3,12,45]. However, one community is a cluster of nodes, and the relationships among nodes in one community should not be simply pairwise, while GSNMF only considers the pairwise relationship of nodes. Essentially, modeling the higher-order relationship among nodes will significantly improve the performance of NMF [42].

Inspired from the theory of hypergraph learning [46], namely, a centroid within a hyperedge connected with more than two nodes by the hyperedge, we encode the higher-order information into NMF by hypergraph, and propose a framework named mixed hypergraph regularized NMF (MHGNMF) in this paper. The proposed MHGNMF contains two community detection methods, which are based on two kinds of cost functions for NMF, respectively. It is worth noting that the proposed framework can be extended to many existing variants of NMF. Our key insight of MHGNMF is that the new representation of nodes, which are within the same hyperedge, should be highly correlated. Since hypergraph construction is crucial to the proposed framework, we design a mixed hypergraph in this paper. Specifically, two sets of hyperedges are firstly generated by two approaches, i.e., topological connection and structural similarity, and then two hyperedges are mixed together for each centroid. Consequently, MHGNMF, which takes higher-order relationship into consideration and simultaneously takes advantages of topological connection and structural similarity, is able to learn a more discriminative representation. As shown in Fig. 2(c), contrast to the misclassification by NMF, the representations of nodes 28 and 30 obtained by MHGNMF are similar, which results in correctly categorizing. We conduct experiments on two artificial and six real-world datasets to evaluate the effectiveness of the proposed framework, where quantitative comparisons with state-of-the-art methods are provided to demonstrate the superiority of proposed framework.

Detection comparison for Dolphins network. (a) The ground truth of Dolphins, (b) the result of NMF based method, (c) the result of MHGNMF based method
Figure 2: Detection comparison for Dolphins network. (a) The ground truth of Dolphins, (b) the result of NMF based method, (c) the result of MHGNMF based method

We also have noted that Zeng et al. [42] incorporated hypergraph regularization term into NMF for image clustering problem. However, this work only adopts one cost function, and constructs the hypergraph by the p nearest neighbors graph, where the generation of hyperedges is based on the distance of raw features. Since the complex networks and images are different datasets, the hyperedges generation strategy in [42] cannot be directly applied in community detection. Therefore, aiming at community detection problem, we reconsider another way to introduce higher-order information into NMF in this paper, i.e. by mixing topological information and structural similarity.

\section{Paragraph 2}

Representing data as a graph or network app ears in numerous application domains, including, for example, social network analysis, biological systems, the Web, and any discipline that focuses on modeling interactions between entities [5, 25, 49]. The simple model of nodes and edges provides a powerful and flxible abstraction, and over time, more expressive models have been developed to incorporate richer structure in data. In one direction, models now use more information about the nodes and edges: multilayer networks capture nodes and edges of diffrent types [36, 48], metapaths formalize heterogeneous relational structure [24, 60], and graph convolutional networks use node features for prediction tasks [35]. In another direction, group, higher-order, or multi-way interactions between several nodes — as opposed to pairwise interactions — are paramount to the model. In this space, interaction data is modeled with hypergraphs [9, 66, 67], tensors [1, 7, 53], affiation networks [40], simplicial complexes [10, 51, 54, 56], and motif representations [11, 55]. Designing methods that effectively analyze the richer structure encoded by these expressive models is an ongoing challenge in graph mining and machine learning.

In this work, we focus on the fundamental problem of clustering, where the general idea is to group nodes based on some similarity score. While graph clustering methods have a long history [26, 43, 47, 57], existing approaches for rich graph data do not naturally handle networks with categorical edge labels. In these settings, a categorical edge label encodes a type of discrete similarity score — two nodes connected by an edge with category label c are similar with respect to c. This structure arises in a variety of settings: brain regions are connected by diffrent types of connectivity patterns [20]; edges in coauthorship networks are categorized by publication venues, and copurchasing data can contain information about the type of shopping trip. In the examples of coauthorship and copurchasing, the interactions are also higher-order — publications can involve multiple authors and purchases can be made up of several items. Thus, we would like a scalable approach to clustering nodes using a similarity score based on categorical edge labels that work well for higher-order interactions.

Here, we solve this problem with a novel clustering framework for edge-labeled graphs. Given a network with k edge labels (categories or colors), we create k clusters of nodes, each corresponding to one of the labels. As an objective function for cluster quality, we seek to simultaneously minimize two quantities: (i) the number of edges that cross cluster boundaries, and (ii) the number of intra-cluster “mistakes”, where an edge of one category is placed inside the cluster corresponding to another category. This approach results in a clustering of nodes that respects both the coloring induced by the edge labels and the topology of the original network. We develop this computational framework in a way that seamlessly generalizes to the case of hypergraphs to model higher-order interactions, where hyperedges have categorical labels.

The style of our objective function is related to correlation clustering in signed networks [8], as well as its generalization for discrete labels (colors), chromatic correlation clustering [12, 13], which are based on similar notions of mistake minimization. However, a key diffrence is that our objective function does not penalize placing nodes not connected by an edge in the same cluster. This modeling diffrence provides serious advantages in terms of traceability, scalability, and the ability to generalize to higher-order interactions.

We fist study the case of edge-labeled (edge-colored) graphs with only two categories. We develop an algorithm that optimizes our Categorical Edge Clustering objective function in polynomial time by reducing the problem to a minimum s-t graph cut problem on a related network. We then generalize this construction to facilitate quickly fiding the optimal solution exactly for hypergraphs. This is remarkable on two fronts. First, typical clustering objectives such as minimum bisection, ratio cut, normalized cut, and modularity are NP-hard to optimize even in the case of two clusters [17, 62]. And in correlation clustering, having two edge types is also NPhard [8]. In contrast, our setup admits a simple algorithm based on minimum s-t cuts. Second, our approach seamlessly generalizes to hypergraphs. Importantly, we do not approximate hyperedge cuts with weighted graph cuts, which is a standard heuristic approach in hyp ergraph clustering [2, 45, 67]. Instead, our objective exactly models the number of hyperedges that cross cluster boundaries and the number of intra-cluster “mistake” hyperedges.

With more than two categories, we show that minimizing our objective is NP-hard, and we proceed to construct several approximation algorithms. The fist set of algorithms are based on practical linear programming relaxations, achieving an approximation ratio of min \(2-\frac{1}{k}, 2-\frac{1}{r+1}\), where k is the number of categories and r is the maximum hyperedge size (r = 2 for the graph case). The second approach uses a reduction to multiway cut, where practical algorithms have a \(\frac{r+1}{2}\) approximation ratio and algorithms of theoretical interest have a \(2-\frac{1}{k}\) approximation ratio