MBA智库文档行业基于归并排序法的铁路OD矩阵形成方法研究.pdf

基于归并排序法的铁路OD矩阵形成方法研究.pdf

下载

Gulogulo_

12页 | 626KB | 0次下载 |

0.0

(0人评价)

我要评价：

投诉举报

用手机看文档

扫一扫,手机看文档

下载

开通VIP

- 1 - 中国科技论文在线 Deriving OD matrix from Railway ticket records based on merge sort method# LIN Ruixi 1 , Wu Jianping 2 , Wang Jiaxi 2 , LIN Boliang 2** 5 (1. Department of Electronics and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, 999077; 2. School of Traffic and Transportation, Beijing Jiaotong University, Beijing 100044, China) Foundations: National Natural Science Foundation of China（No. 51178031); project of China Railways Corporation( 2013X005-A） Brief author introduction:LIN Ruixi(1993-),Female,Undergraduate student,Information Iechnology,IC Design Correspondance author: LIN Boliang(1961-),Male,Professor,Railway Operation Management,Transportation Systems Network Design,Transportation and Logistics,Intelligent Transportation System. E Abstract: In a transportation system, OD pairs are usually accumulated by large quantities of primary records. The records may come from train tickets, freight invoices, and flight trip tickets etc. In 2013, 10 billion passengers are transported by China railway. Each train ticket contains information of origin and destination stations, train number, and train class etc. Only several fields of information, such as origin and destination, are required according to specific research purposes. The rest of the fields are unnecessary and need to be merged. A direct merging method in which each record compares with all existing OD pairs will consume a lot of time when there are large quantities of records. This 15 work will propose a sorting-based merging method, which applies to general OD matrices. The time complexity is scale of O(NlogN) . For tests, 30 samples are used for numerical analyses. The results show that the efficiency is far better than that of the direct merging method. Key words: OD matrix, Railway ticket, merging, sorting. 20 0 Introduction Origin-destination (OD) pairs are fundamental data in the field of transportation, and are widely used in various applications. For example, predictive OD matrices of the long-term future are the basis of optimizing transportation network design, while historical OD matrices usually apply to design optimal transportation plans. The size of an OD matrix is always huge in a real 25 transportation system. Each element of the matrix, . an OD pair, is accumulated by numerous passenger tickets or freight invoices. For a typical transportation system, one of the fundamental problems is to efficiently integrate the records into a simple OD matrix. At present, there are over 100 thousands kilometers and thousands of stations in China railway systems. The length of high-speed railway is over 11,000 kilometers. There have been 30 4894 passenger trains including 2660 high-speed trains(EMU) in China railway system since July 2014, and according to statistics, in 2013, billion passengers travelled by China railway [1] . The average daily passenger traffic volume reaches million. Each freight station is associated with about 104 OD pairs, and most passenger stations are associated with 95 OD pairs on average [1] . It is clear that the OD matrix of China rail network is quite sparse. Based on this 35 characteristic, the author of [1] proposed an origin-associated merging method to integrate the primary ticket records effectively. However, the algorithm results in low efficiency where the OD matrix does not show a sparse characteristic. To deal with more general cases, this work proposes a merging method based on a sorting algorithm. The proposed method sorts the primary ticket records by origins, then sorts by destinations, and finally merges the sorted tickets. The efficiency 40 of this method mainly depends on the sorting algorithm. Typical sort algorithms include bubble sort, selection sort, insert sort, shell sort, quick sort, merge sort and heap sort [2~8] . Bubble sort and selection sort are stable but only works well for - 2 - 中国科技论文在线 short sequences of records. Compared to merge sort, the time complexity of insert sort is higher, while shell, quick, and heap sort are unstable. The sequences of ticket records in this work are 45 large, hence we adopts the comparatively stable and efficient merge sort to preprocess our data. The rest of the paper is organized as follows: Section 1 describes two existing merging methods. In Section 2, we give a simple illustration on the procedures of merge sort. A detailed description of the merging method based on sort is given in section 3. Computational results and conclusions are presented in section 4 and 5. 50 1 Direct merging and origin-associated merging method Description of direct merging algorithm Let initS be a set of N ticket records. Without loss of generality, we assume that each record only consists of origin and destination stations. Let mergeS stores the OD pairs after merging the records. Move the first record in initS to mergeS . During each comparison, compare 55 the origin and destination of a record in initS with the origins and destinations of all records in mergeS . If any two records contain the same origin and same destination, merge the record in initS into the existing record in mergeS ; otherwise add that record into mergeS . Repeat the above step until all records in initS are compared. Assume there are M origin-destination pairs after merging N ticket records. The complexity of this algorithm is 4/)1( MN  [1]. 60 Origin-associated merging method Considering that the real railway OD matrix is highly sparse, the author of [1] proposes a novel merging approach named origin-associated merging method. During each comparison，first compare the origins of two records, if their origins are same, compare their associated destinations. In this way the number of comparison times can be effectively reduced. The complexity of the 65 algorithm is  4/)1(4/)1( DYN  , where Y is number of all origins of records and D is the number of destinations that each origin is associated with. 2 The Merge Sort Algorithm The merge sort (MS) algorithm is an efficient sorting algorithm built on the basis of merging operations. This algorithm exemplifies the divide-and-conquer algorithms. Suppose the original 70 sequence contains N records, denoted by )(NS . We can divide this sequence into two subsequences, and again divide each subsequence into two subsequences. Repeat dividing until )(NS is divided into N subsequences. Eventually, each subsequence will contain one element. Then merge every two subsequences until the whole sequence is sorted. The nature of this algorithm is dichotomy. Besides dichotomy, trichotomy or even K dividing method can be 75 utilized. In general cases, dichotomy works well. Considering that the complexity of segmentation methods will grow rapidly as the number of segments increase, this work adopts the principal of dichotomy in the proposed method. Using dichotomy [9] , we segment )(NS in the following manner: - 3 - 中国科技论文在线 80 Fig. 1 The operation of merge sort on the array )(NS After division, it will become a tree with N elements in each layer. For two sorted sequences )(nA and )(mB , the time consumed in merge-sorting them is nm , which is in the complexity scale of )(nO . Therefore, the time used for each layer is in the scale of )(nO . In addition, the depth of the tree is Nlog . Multiplying the time for each layer with the depth, we 85 can obtain the total time spent on the sort process is )log()(*log)( NNONONNT  . The detailed MS algorithm is described by the following steps: Step 1: Divide the sequence of length N into two subsequences of length 2/N . Step 2: Merge sort the two subsequences respectively. Step 3: Combine the two sorted subsequences into a single sorted sequence. 90 It is not hard to find that the core of this algorithm is the merging process, which is a recursive process. 3 The sort merge method Description of the sort merge method Let initS be a set of N ticket records. Define B as a temporary set which stores the 95 records with the same origin. And let C stores the OD pairs after merging the records.. Firstly, empty the set initS , B and C . Add all records into initS and sort them according to the origin by MS algorithm. Secondly, move the first record of initS into B . For the kth record init k Sr  , if the origin are same with the first record in set B , add this record into B . Otherwise, let m be the number of all elements in set C . For each element Ch j  , let jn 100 denote the ticket number , move the first record of B into C . Let 1 mm and 1mn . For the ith record in set B , if the destination of this record is the same with the mth record of C , 1 mm nn . Otherwise, move the ith record into C , and let 1 mm and 1mn . Repeat this operation until all elements in B are compared. The detailed algorithm is illustrated in the flow chart. 105 S(N) S[1:N/2] S[1:N1/2] S[N1/2+1:N2] S[1:N2/2] S[N2/2+1:N3] S[1:N2/2] S[N2/2+1:N3] S[N/2+1:N] S[1:N1/2] S[N1/2+1:N2] S[1:N2/2] S[N2/2+1:N3] S[1:N2/2] S[N2/2+1:N3] …… S[1:1] S[1:1] … N1=N/2 N2=N/4 N3=N/8 Nj=N/2 j Nk=1,k= log2N - 4 - 中国科技论文在线 Flow chart of sort merge method In figure 2, the two blue diamonds with MS notation represent two subprograms. The square in pink denoted by Merge the record in set B and move the result to set C is another subprogram 110 shown in figure 3. Y Y N N N Sort N records of S init M S Define set B and move inti1 Sr  into B . Let 1k k=k+1 ? Nk  M S Add all N ticket records into set intiS Move kr into B Y Are the origin of kr and 1kr the same ? END START Sort all records of B according to the destinations. Merge the record in set B and move the result to set C ? Nk  Empty B - 5 - 中国科技论文在线 Subprogram of sorting by destinations and merging Complexity Analysis 115 After we have sorted the all N records by origins, assume there are m origins whose names are independent, with each origin associated with mNNN ,,, 21  records. The number of all records can be expressed as    m i iNN 1 . To sort the iN records of the ith origin by destinations, the computational time will be ii NN log . The time used to sort the m sequences 1 mm Move Bri  into C , incorporate the ticket number field into ir , denote the new record as mh and let 1mn . Y Are the destination of ir and 1ir the same ? Let 1i . Let m be the number of all elements in set C . For each element Ch j  , let jn denote the ticket number of jh . START N Y END 1 mm nn ; 1 ii N Bi  ? 2B Y N - 6 - 中国科技论文在线 by destinations will be: 120 mm NNNNNN logloglog 2211   (1) Set  mNNNI ,,,max 21max  , then we can obtain max2211 loglogloglog INNNNNNN mm   (2) In this way, the number of comparisons in merging is only 1N when the sequence has been sorted by origins and by destinations. Since NI max , the sort merging method leads to a time 125 complexity of )log( NNO . 4 Algorithm testing and computational results Complexity Analysis For the convenience of comparison, we also use the samples from [1] to measure the efficiency of the proposed method in this work. The author of [1] uses 30 samples of ticket records 130 for computation, all of which are corresponding with sparse OD matrices. The computation results of our method are shown in table 1. Tab. 1. Computational results for sparse OD matrices instances No. N M  1T (sec) 2T (sec) 3T (sec) 12 /TT 13 /TT 23 /TT 1 200000 126242 98 1 1 % % % 2 400000 172681 296 4 2 % % % 3 600000 189789 524 6 3 % % % 4 800000 196185 673 8 4 % % % 5 1000000 198557 872 10 5 % % % 6 1200000 199385 1054 11 6 % % % 7 1400000 199697 1239 14 7 % % % 8 1600000 199835 1449 15 8 % % % 9 1800000 199871 1639 18 9 % % % 10 2000000 199897 1843 20 10 % % % 11 2200000 199912 2057 22 10 % % % 12 2400000 199914 2303 24 12 % % % 13 2600000 199915 2510 26 13 % % % 14 2800000 199919 2671 29 15 % % % 15 3000000 199914 2926 32 16 % % % 16 3200000 199915 3082 35 17 % % % 17 3400000 199910 3336 37 18 % % % 18 3600000 199911 3540 39 19 % % % 19 3800000 199912 3715 39 20 % % % - 7 - 中国科技论文在线 20 4000000 199923 3864 42 21 % % % 21 4200000 199913 3987 42 23 % % % 22 4400000 199915 4222 44 23 % % % 23 4600000 199922 4410 46 24 % % % 24 4800000 199924 4577 47 25 % % % 25 5000000 199929 4820 49 27 % % % 26 5200000 199923 4977 51 28 % % % 27 5400000 199925 5251 52 29 % % % 28 5600000 199928 5622 55 30 % % % 29 5800000 199921 5824 55 31 % % % 30 6000000 199916 5906 59 32 % % % In table 1, the first column represents the sample indices, the second column N is the number 135 of tickets in each sample. The step size of the sample number is 200,000 items. The 30 th sample contains 6 million tickets. The third column M is the number of OD pairs after merging. The forth column is the reduction ratio, . NM / . 1T ， 2T 和 3T represents the computation time of the direct merging method, origin-associated merging and sort merging method. The last three columns are the ratios of the computational times between them. The last column shows that 140 the sort merging method and the origin-associated merging method works competitively, the merging time of sort merging is about half of that of origin-associated merging, . 23 TT . The time curves of the two methods are illustrated in figure 4. Computational time of two merging methods for sparse OD matrices 145 Complexity Analysis To test the performance in non-sparse matrices, we assume again that there are 2000 stations in the transportation network, denoted by A0001～A2000. The origins of the ticket records are - 8 - 中国科技论文在线 randomly generated from A0001～A2000. Contrast to [1], the destinations range from 2000 stations labelled by A0001～A2000 instead of only 100 stations. Records with identical origin and 150 destination will be excluded. The programs run on the same personal computer with Intel(R) Core(TM)i5 processor as in [1]. The computational results are shown in table 2. Tab. 2. Computational results for non-sparse OD matrices instances No. N M  2T (sec) 3T (sec) 23 /TT 1 200000 195017 3 1 % 2 400000 380545 4 3 % 3 600000 557472 10 4 % 4 800000 725347 12 5 % 5 1000000 884467 18 7 % 6 1200000 1036824 22 7 % 7 1400000 1181961 26 8 % 8 1600000 1318598 33 10 % 9 1800000 1449510 40 10 % 10 2000000 1573798 45 11 % 11 2200000 1691447 53 13 % 12 2400000 1804570 63 14 % 13 2600000 1911197 71 15 % 14 2800000 2012880 78 15 % 15 3000000 2110633 88 18 % 16 3200000 2202549 92 17 % 17 3400000 2290739 102 18 % 18 3600000 2373181 112 21 % 19 3800000 2451604 117 21 % 20 4000000 2528324 128 22 % 21 4200000 2599760 130 25 % 22 4400000 2668570 136 26 % 23 4600000 2733313 143 27 % 24 4800000 2793936 148 27 % 25 5000000 2853626 158 28 % 26 5200000 2909563 168 30 % 27 5400000 2962193 177 32 % 28 5600000 3012856 187 32 % 29 5800000 3060224 193 33 % - 9 - 中国科技论文在线 30 6000000 3106888 202 34 % Theoretically, for a transportation network of 2000 stations, the possible number of OD pairs is 2000×1999 (around 4 million), while the randomly generated 6 million records lead to about 3 155 million merged OD pairs. As the sample size enlarges, the number of OD pairs approaches 4 million. Table 2 shows that the reduction ratio can reach 52% for non-sparse OD matrices, given that any two origin and destination pair has equal probable transportation demands. The curve in figure 5 is in line with this asymptotic tendency. 160 Reduction ratio for non-sparse OD matrices instances In addition, table 2 also indicates that the sort merging method have higher performances than the other method for non-sparse OD matrices. Under the situation of 6 million samples, the time using sort merging is only around 1/6 of that of the origin-associated merging. The time curves of the two methods are shown in figure 6. 165 - 10 - 中国科技论文在线 Computational time of two merging methods for non-sparse OD matrices From table 1 and table 2, it can be concluded that sort merging is insensitive to the sparseness of an OD matrix. This feature can be easily found in figure 7. 170 Fig. 7 The time characteristics of sort merging for sparse and non-sparse OD matrices The pink curve in figure 7 represents the time performance for non-sparse OD matrices, and the blue curve represents that of the sparse OD matrices. Both curves are approximately linear with a slope of . The pink curve almost coincide with the blue one, which indicates that the performance of sort merging is not likely to be affected by the sparseness of the samples. In 175 contrast to sort merging is the origin-associated merging, whose performance is highly sensitive to the sparseness of the samples. This feature of origin-associated merging can be seen from figure 8. - 11 - 中国科技论文在线 Fig. 8 The time characteristics of origin-associated merging for sparse and non-sparse OD matrices Figure 8 displays time characteristics of origin-associated merging under both situations of 180 sparse and non-sparse OD matrices samples. The pink and blue curve denote the sparse and non-sparse cases respectively. It is clearly seen that the origin-associated merging efficiency drops as the number of records in non-sparse matrices samples grows. For the last sample with as many as 6 million records, the computational time is 59 seconds for sparse OD matrices samples (see 2T in table 1), while it costs 202 seconds for non-sparse OD matrices samples (see 2T in table 185 2). The consumed time grows more than three times. 5 Conclusion This work proposes a sort merging method for OD data integration, which sorts all primary ticket records by origins, then sort all subsets of records with the same origin by destinations, finally merge the records with the same origin and same destination. The merging efficiency 190 improves significantly using this method. It achieves the same effiecny as origin-associated merging when dealing with sparse OD matrices, and outperforms the latter method greatly when the OD matrices are not sparse enough. Acknowledgements (Optional) The authors gratefully thank PhD students Xuhong Wen, Lei Chen and Jian Li Beijing Jiao 195 Tong University, for advice on the refinements and embellishments of figures in this work, and suggestions on the tests and computational analyses respectively. References [1] LIN Ruixi，LIN Boliang. An approach to merging railway ticket records into origin-destination pairs based on sparse matrix characteristic [OL]. [2014-08-25]. China Sciencepaper Online. 200 [2] Oyelaml Olufemi Moses. Improving the performance of bubble sort using a modified diminishing increment sorting [J]. Scientific Research and , 4(8):740-744. [3] Iraj H., Afsari M. H. S., Hassanzadeh S. A New External Sorting Algorithm with Selecting the Record List Location[J]. USEAS Transactions on Communications. 2006, 5(5):909-913. 205 [4] Srirupesh,Ttonhunt,Sweetesh. Bidirectional Expansion - Insertion Algorithm for Sorting[C]. Second International Conference on Emerging Trends in Engineering and Technology, 2009, 59-62. - 12 - 中国科技论文在线 [5] , . Stochastic analysis of shell sort[J].,31:442-457. [6] Wang Xiang. Analysis of the time complexity of quick sort algorithm. Information Management, Innovation Management and Industrial Engineering (ICIII), 2011 ,408-410. 210 [7] Jafarlou M. Z., Fard P. Y. Heuristic and Pattern Based Merge Sort. Procedia Computer Science. 2011, 3: 322-324. [8] Xu Xiu. Distributed computing systems using heap sort of multi-layer sub-tree by task match scheduling algorithm. 3rd International Symposium on Parallel Architectures, Algorithms and ,321-325. [9] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein. Introduction to Algorithms (3rd 215 edition) . The MIT Press, 2009. 基于归并排序法的铁路OD矩阵形成方法研220 究林睿熙1，武建平2，王家喜2，林柏梁2 （1. 香港科技大学,电子与计算机工程系,香港,999077； 2. 北京交通大学，交通运输学院，北京，100044）摘要：在一个运输系统中，OD 对的形成通常是通过大量的原始记录累积而成，原始记录可225 能是火车票、货运单据、飞机票等等。例如，2013年中国铁路系统发送了亿人次的旅客。每张火车票包含了发站、到站、车次、席别等字段信息。而不同的研究目标，需要关注的字段是不同的。如果我们的某项研究只需要关注其中的发到站，则对其余字段就需要进行合并处理。把每张票和已有的 OD对进行发到站逐一比对的直接归并法将耗费大量的计算时间。本文提出了一种基于排序法的 OD 矩阵形成技术，其时间复杂度为 O(NlogN)。作为实230 验，我们计算了 30 个样本进行数据分析，结果显示本文所提方法比直接合并法的效率有明显的提高。关键词：OD 矩阵，铁路运输票据，归并，排序. 中图分类号：TP274 235

联系我们

智库文档公众号

客服微信

基于归并排序法的铁路OD矩阵形成方法研究.pdf

下载

标签

相关专题更多

联系我们

意见反馈

标签

相关专题 更多

联系我们

意见反馈

相关专题更多