智库文档所有分类

基于稀疏矩阵特点的铁路原始客货流归并方法研究.pdf

下载

Jsjacks

11页 | 488KB | 0次下载 |

0.0

(0人评价)

我要评价：

投诉举报

用手机看文档

扫一扫,手机看文档

下载

开通VIP

- 1 - 中国科技论文在线 An approach to merging railway ticket records into origin-destination pairs based on sparse matrix characteristic 5 LIN Ruixi 1 , LIN Boliang 2** (1. Department of Electronics and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, 999077; 2. School of Traffic and Transportation, Beijing Jiaotong University, Beijing 100044, China) Brief author introduction:LIN Ruixi(1993-),Female,Undergraduate student,Information Iechnology,IC Design Correspondance author: LIN Boliang(1961-),Male,Professor,Railway Operation Management,Transportation Systems Network Design,Transportation and Logistics,Intelligent Transportation System. E-mail: Abstract: This work describes a data merging method under the background of railway tickets data 10 integration. Origin-destination (OD) pairs are fundamental data in transportation engineering, and are widely used in various applications. For example, predictive OD pairs of the long-term future are the basis of optimizing transportation network design. Optimization of transportation plans also relies on existing OD pairs. The size of an OD matrix is always huge in a real transportation system. Each element of the matrix, . an OD pair, is accumulated by numerous passenger tickets or freight invoices. 15 The size of an OD matrix is huge in real transportation systems. In a railway system, each element of the matrix, . an OD pair, is accumulated by numerous passenger ticket or freight invoices. In 2013, China railway system transported billion passengers. In other words, an equivalent number of tickets were sold. Each ticket consists of information on the name of origin and destination, train number, and seat class etc. Similarly, a railway freight invoice consists of information on the name of 20 origin and destination, freight category (. coal, oil, grain, and ore), and volume etc. Obviously, the tickets usually contain some identical fields of information. For example, two tickets can share the same origin and destination. Researchers use only particular fields of the tickets for different analysis purposes. Consequently, an effective approach is needed to merge the ticket records based on interested keywords and creates the corresponding compressed OD matrix. A simple direct merging method costs 25 a lot of computations. This work proposes an effective and efficient approach, named origin-associated merging, to merge ticket records based on the sparse matrix characteristic. For experiments, 30 samples are created according to the characteristic of China railway passenger flow, ranging from 200 thousands to 6 million records with a step of 200 thousands records. The experimental results show that the time using origin-associated merging is about 1% of that of direct merging. It is worth mentioning 30 that the number of passenger tickets in China railway system is approximately 6 million per day. Key words: data integration; rail transportation; OD pair; sparse matrix; ticket records 0 Introduction Data merging is an important method in data processing. With the progress of information 35 technology, the scale of data has been rapidly expanded, especially in areas involved in computer applications. The work of Ying [1] proposes a method to merge and integrate catalog data in merged university libraries. Chen [2] gives a preliminary study on patterns of library network construction and data integration since university merging of China. In the paper of Zhang [3] , several issues are discussed including the integration of bibliographic data and merging of data to 40 export, splitting bar code number, unifying the length of bar code number, deleting duplicate records and merging data. However, works on the merging of OD data in railway passenger tickets or freight invoices has not been found explicitly. An optimization of passenger or freight train plan depends on the analyses of current or annual prediction of OD data [4][5][6] . While the optimization of railway network design usually 45 depends on the prediction of OD data ten years later [7][8] . At present, China railway operation mileage is over 100 thousands kilometers, among which the length of high-speed railway is over 11,000 kilometers，covering thousands of stations. There have been 4894 passenger trains - 2 - 中国科技论文在线 including 2660 high-speed trains(EMU) in China railway system since July 2014 [9] . According to statistics, in the past year, billion passengers travelled by China railway [10] . It should be 50 noted that billion tickets does not impose the same amount of OD pairs. In general, an OD pair consists of a great number of passengers. To merge billion tickets into the needed OD pairs, one ticket is compared to another by their origin station names and destination station names respectively. An estimated comparison times will exceed 100 trillion, which is hard to realize on computers. For a large-scale railway network such as China railway network, only a few pairs of 55 stations present passenger demands. In other words, for a given station, passengers travel to only a few stations. For instance, assuming 2000 stations in the railway network, at a given station A, the passengers can travel to all the other 1999 stations in theory. In reality, they actually travel to only a small number of stations among the 1999 stations. From the matrix point of view, the railway OD matrix is quite sparse. Based on this sparse characteristic, this paper proposes an 60 origin-associated OD records mergence approach. The rest of the paper is organized as follows: Section describes the direct merging method. In Section we provide a complexity analysis of direct merging. In Section we illustrate the origin-associated merging method, with a complexity analysis in section . Numerical analyses utilizing both methods are described in section 3. 65 1 Direct merging method A typical train ticket contains information of the name of origin and destination, train number (which implicitly determines the type of train), class of service (hard seat, hard sleeper, or soft sleeper etc), and date etc. Assuming that there are N ticket records to be merged, the direct merging method is described as follows. 70 Description of the direct merging algorithm Let initS be a set of N ticket records. Without loss of generality, we assume that each record only consists of origin and destination stations. Let mergeS stores the OD pairs after merging the records. We assume that each OD pair is comprised of three fields: origin, destination, and number of tickets. Firstly, empty the set mergeS , move the first record 1r of initS into 75 mergeS , name this record as OD pair 1h and set the field of number of tickets as 1. Secondly, compare the second record 2r of initS with 1h of mergeS . If the origin of 2r is the same of that of 1h and the destination of 2r is the same of that of 1h , merge 2r into 1h and increase the number of tickets of 1h by 1. If either the origins or destinations are different, move 2r into mergeS and set the number of tickets of 2h as 1. Continue moving and comparing records. When 80 it comes to the record kr , assuming there are already m OD pairs in mergeS , compare kr with the elements of mergeS successively. If the origin of kr is the same of that of ih in mergeS and the destination of kr is the same of that of ih , merge kr into ih and increase 1 to the number of tickets of ih . If either the origins or destinations are different for all m OD pairs in mergeS , add kr to mergeS and note it as 1mh ，set its field of number of tickets as 1. Repeat the above 85 operations until all records in initS are compared. The detailed algorithm is illustrated in the flow chart (Fig. 1). - 3 - 中国科技论文在线 START Empty set S merge i=0. Add N ticket records into set S init，k=1. Move the first record in S Init to S merge，k=1,m=1. Compare the kth record rk in S Init with the ith record hi in S merge Merge rk to hi ，k=k+1 END N Y N Y N Y Are all records in S init compared ? （>N?） Same orgin and same destination? i=i+1 Are all records in S merge compared? （>m?） Add rk to S merge，m=m+1，k=k+1 Fig 1. The flow chart of direct merging algorithm Complexity analysis 90 Assume there are M origin-destination pairs after merging N ticket records. Define the reduction ratio  as NM . Apparently, the record 1r does not need to be compared, so the computation time is 0. The last record Nr will be compared M times in the worst case, while it needs to be compared only once in the least case. Therefore, as an approximate estimation, the average times of comparison a record is about 4/)1( M , and dealing with N records needs 95 - 4 - 中国科技论文在线 a total comparison time of 4/)1( MN  , , the complexity 1 of the direct merging algorithm is 4/)1(1 MN  (1) We can infer from (1) that the computation time of this algorithm grows linearly with respect to the number of records when M is a constant. 100 2 Origin-associated merging method There will be  1 YY station pairs for a railway network with Y stations theoretically, but for a large-scale network, not all station pairs have traffic demands. For example, China railway network consists of 5,544 stations in 2007 with 520,000 freight OD pairs [6] . Each station has only OD pairs on average. In other words, for a certain freight origin, the number of 105 corresponding destinations is less than two percent of all stations. It is obvious that the railway OD matrix is highly sparse. Utilizing this characteristic, we can first compare the origins of two records, if their origins are same, compare their associated destinations. In this way the number of comparison times can be effectively reduced. Based on this idea, we propose a novel merging approach named origin-associated merging method. The algorithm is described as follows. 110 Description of the origin-associated merging algorithm Let F be a set of all origins. The size of the set is Y . Let )(iD be a set of all possible destinations of origin i . Let k iDn )( be a variable of the number of tickets whose origin is the thi element in F and whose destination is the thk element in )(iD . Firstly, empty the set F , copy the origin of the first record 1r of initS to F , copy its destination to )1(D and set 115 11 )1( Dn . When it comes to record kr in initS , assuming there are already fn origins in F , compare the origin of kr with that of all fn origins. If the origin of kr matches the thi element of F , compare the destination of kr with each elements of )(iD respectively. Stop comparing when the destination of kr matches the thh element in )(iD . Increase the value of h iDn )( by 1. If no matching element is found, add the destination of kr to )(iD and set 120 1 1)( )(  iD iDn . However, in the case where the origin of kr does not match any elements of F , add the origin of kr to F and increase the value of fn by 1. Meanwhile, add its destination to )( fnD . Repeat the above operations until all records of initS are compared. Detailed algorithm is shown in the following flow chart. - 5 - 中国科技论文在线 START Define set Sinit，move all remaining records to it，k=0 Let k=k+1，take the kth record rk in S init Is the origin of rk in F? N nf=nf+1，add this origin to F，define an empty set D(nf) Is the destination of rk in D(i) and match the hth element? Y N Are all records in S init compared? END N Y Y Add the destination of rk to D(nf)，and set the corresponding number of tickets 1 ( ) 1fD nn  Increase the number of tickets of the corresponding destination in D(i) by 1 ,., ( ) ( ) 1 h h D i D in n  Add the destination of rk to D(i)，and set ( ) 1 ( ) 1 D i D in   Define an empty set F，move the origin of the first record to F，nf=1；Define an empty set D(1)，move the origin and destination of the first record to D(1)，and set the corresponding number of tickets 1 (1) 1Dn  125 Fig 2. The flow chart of origin-associated merging algorithm Complexity analysis Skip the first record 1r . The origin of the last record Nr will compare Y times in the worst case, while it needs only one comparison in the least case. An average comparison time is the mid-value of the two extreme values, . 2/)1( Y . Consequently, the average comparison 130 time of the origin of all elements in initS is 4/)1( Y . The origins of all N records needs a comparison time of 4/)1( YN  . Now assume each origin is associated with D destinations. - 6 - 中国科技论文在线 Given a certain origin Fu , suppose there are uN records with origin u , among which the destination of the first record need not be compared; for the last record, its destination needs to be compared D times in the worst case, but only once in the least case. In this analogy, for any 135 record, the average comparison time should be the mid-value, that is 2/)1( D . Therefore, for all uN records, the average comparison time is 4/)1( D . We can infer that the complexity 2 of this method is  4/)1(4/)1(2 DYN  （2） Clearly, the relative efficiency  of the two algorithms is 140   M DY MN DYN      4/)1( 4/)1(4/)1( 1 2    （3） Suppose 000,000,6N (which is roughly equal to the number of tickets sold per day in China railway), 000,200M , 2000Y , and 100D , then %1 . Apparently, origin-associated merging method improves the merging efficiency significantly. 3 Numerical Analyses 145 We observed from our work supported by China Railways Corporation that, in recent years, freight stations with lower freight demands have been closed and the number of freight stations in operation has reduced to about 3,239. In addition, the number of OD pairs is 336,384, which suggests that each station is associated with about 104 OD pairs. As for rail passenger transportation, the number of stations with an departure or arrival volume over 10,000 passengers 150 per year is about 2,100. The number of the OD pairs of these stations is about 200,000. On average, each station is associated with about 95 OD pairs. In conclusion, the OD matrix of China railway network is highly sparse. For the convenience of analysis, we may neglect some stations whose transportation demands are low, so we can assume there are 2000 major passenger stations, labeled from 155 A0001～A2000. The origins of the records are generated randomly from A0001to A2000. In theory, for a certain station, there should be 1999 OD pairs. Without loss of generality, We assume that each origin is associated with 100 destinations. We also assume the destinations are randomly picked from 100 stations, labeled by B0001~B0100. In fact, B0001~ B0100 should be parts of A0001~ A2000. We separate the notations of origins and destinations for the convenience of 160 analysis, without loss of logical reasoning. According to the above hypothesis, we generate 30 samples. The numbers of records in these samples are 200,000, 400,000, 600,000, 800,000, ……, 6,000,000 respectively. Analysis of records reduction ratio We adopt direct merging and origin-associated merging approaches respectively to do 165 computations on the samples described above. All experimental analyses are done on a personal computer with Intel(R) Core(TM) i5 processor, CPU frequency at , and a memory of 4G bytes. The computational results are shown in table 1. Tab. 1 Major computational results No. N M Time_1/(s) Time_2/(s)  1 200000 126242 98 1 % 2 400000 172681 296 4 % - 7 - 中国科技论文在线 3 600000 189789 524 6 % 4 800000 196185 673 8 % 5 1000000 198557 872 10 % 6 1200000 199385 1054 11 % 7 1400000 199697 1239 14 % 8 1600000 199835 1449 15 % 9 1800000 199871 1639 18 % 10 2000000 199897 1843 20 % 11 2200000 199912 2057 22 % 12 2400000 199914 2303 24 % 13 2600000 199915 2510 26 % 14 2800000 199919 2671 29 % 15 3000000 199914 2926 32 % 16 3200000 199915 3082 35 % 17 3400000 199910 3336 37 % 18 3600000 199911 3540 39 % 19 3800000 199912 3715 39 % 20 4000000 199923 3864 42 % 21 4200000 199913 3987 42 % 22 4400000 199915 4222 44 % 23 4600000 199922 4410 46 % 24 4800000 199924 4577 47 % 25 5000000 199929 4820 49 % 26 5200000 199923 4977 51 % 27 5400000 199925 5251 52 % 28 5600000 199928 5622 55 % 29 5800000 199921 5824 55 % 30 6000000 199916 5906 59 % 170 The forth column of table 1 is the reduction ratio, which is calculated by the value of the third column over that of the second column, . NM Take the number of records as the horizontal axis, reduction ratio as the vertical axis, we can obtain the curve of reduction ratio shown in figure 3. - 8 - 中国科技论文在线 175 Fig. 3 Reduction ratio of railway passenger ticket records Figure 3 indicates the reduction ratio  becomes smaller and approaches 0 as the number of tickets grows. This feature will be more evident if we set the merged OD pairs as vertical axis shown in figure 4. It can be easily seen that as the number of records exceed 1 million, the corresponding value of the y-axis saturates at 200,000. 180 Fig. 4 Number of records before/after merging Calculation time analysis using direct merging We can draw the calculation time curve using the data in column 5 of table 1 (Fig. 5). The calculation time grows linearly as the number of records grows, which meets our theoretical 185 formulation 4/)1( MN  . Since the structure of passenger flow depends on the population distribution and passenger train plan in a region, M is approximately a constant within a given period. - 9 - 中国科技论文在线 Fig. 5 Computation time of direct merging 190 Calculation time analysis using origin-associated merging We can draw the calculation time curve using the data in column 6 of table 1 (Fig. 6). The calculation time also grows linearly as the number of records grows, which basically meets our theoretical formulation  4/)1(4/)1( DYN  . It will take about 1 second to merge every 100,000 records. For a give railway network and a stable population distribution, Y and D are 195 roughly constants. Of course, Y and D will change slowly as the scale of railway network expands and the incomes of residents grow. Fig. 6 Computation time of origin-associated merging Because the computation speed of the origin-associated merging is nearly 100 times faster 200 than direct merging, log scale representations can be more intuitive. Hence we plot a time curve with a log-scale y-axis in figure 7. - 10 - 中国科技论文在线 Fig. 7 Comparison between the computation times of two methods In the figure above, the red curve represents the direct merging method and the blue one is 205 the other method. It is not difficult to find that the vertical interval between the two curves is approximately constant. This further confirms the accuracy of (3). 4 Conclusion Based on the sparse matrix characteristic, this paper proposes the merging methods to generate railway OD pairs. The first method, namely the direct merging method, compares the 210 records origin by origin, then destination by destination. Then we refined the direct merging method considering the sparse characteristic of the OD matrix, and proposed a second method, namely the origin-associated merging method. Origin-associated merging significantly enhances the merging performance. The computation speed rises roughly 100 times compared to direct merging. The approach in this work issues from the demands of the optimization for railway 215 transportation operation and management. Nevertheless, the underlying method can also be applied in different transportation systems, such as highway transportation systems, urban transport management and traffic control, and urban rail transit. We believe that this method can achieve a similar good performance as long as the corresponding OD matrix has a highly sparse feature. Take the railway freight transportation as an example. A railway freight invoice record 220 includes information of the name of origin and destination, freight category (. coal, oil, grain, and ore), car type, volume, and cars needed etc. In railway network design problem, researchers only care about the volumes between two zones or stations. While for the design of a coal transportation rail line, besides the origin and destination, people also need the information of freight categories in the records. Moreover, in the researches of railway empty car distribution, the 225 interested fields of information are the origins, destinations and railcar types. In general, when dealing with the design of transportation network, traffic plan and etc., we frequently encounter the problems of merging huge quantities of passenger or freight transportation records according to different needs. The raw records often come in tens of thousands, and even in millions. Repeated mergence of data will consume a lot of time, so researchers are trying various techniques 230 to find efficient merging methods. Approaches to merging records according to the interested fields have become an important research topic in large-scale transportation systems. This work - 11 - 中国科技论文在线 can be used as technical support for further research works. Acknowledgements The author gratefully acknowledges PhD students Xuhong WEN and Lei CHEN from 235 Beijing Jiao Tong University for advice on the refinements and embellishments of figures in this work, and master student Jianping WU from Beijing Jiao Tong University for advice on the tests and computational analyses. References [1] YING H. The mergence and integration of catalog data in merged university libraries[J]. Journal of 240 academic libraries, 2004, 21(5) : 63-65 [2] CHEN M H. Preliminary Study on Patterns of Library Network Construction and Data Integration Since University Merging[J]. Journal of Library and Information Sciences in Agriculture, 2005, 17(3): 14-16 [3] ZHANG W H. Technologies of the mergence and integration of catalog data in the university libraries[J]. Journal of Library and Information Sciences in Agriculture, 2010 (4): 81-83 245 [4] LIN B L, WANG Z M, JI L J, et al. Optimizing the freight train connection service network of a large-scale rail system[J]. Transportation Research Part B: Methodological, 2012, 46(5): 649-667 [5] BARNHART C, JIN H, VANCE P H. Railroad blocking: a network design application[J]. Operations Research, 2000, 48 (4), 603-614. [6] MARTINELLI D R, TENG H. Optimization of railway operations using neural networks[J]. Transportation 250 Research Part C, 1996 4 (1), 33-49. [7] LIN B L, XU Z Y, HUANG M, GUO P W. An optimization model to railroad network designing[J]. Journal of the China Railway Society, 2002, 24 (2), 1-6. [8] KUBY M, XU Z Y, XIE X D. Railway network design with multiple project stages and time sequencing[J]. Journal of Geographical System, 2001, 3: 25-47. 255 [9] News China. New train schedule will come into effect from July, 2014. [OL]. [2014]. [10] China Railway Corporation. Cumulative completion of main indicators of national railway from January to December in 2013[OL]. [2014]. 260 基于稀疏矩阵特点的铁路原始客货流归并方法研究林睿熙1，林柏梁2 （1. 香港科技大学,电子与计算机工程系,香港,999077； 2. 北京交通大学，交通运输学院，北京，100044） 265 摘要：在铁路客货票数据大规模增长的背景下，本文提出了一种数据归并方法。OD （origin-destination pairs）数据是交通运输系统的基本数据，远期预测的 OD 数据是交通运输网络设计的基础，而当前的 OD 数据是设计各种优化运输方案的主要依据。由于在一个大规模的运输系统中，不仅仅 OD矩阵非常大，而且每个 OD 对本身就是大量的原始客货票据累加而成。例如，2013年中国铁路就发送了亿旅客，每张客票的数据包括了：发站、270 到站、车次、席位种类等信息。类似地，体现铁路货运 OD 的货票记录就包括了：发站、到站、货品名、使用货车种类、运输量等字段。海量的记录往往包含一些共同的字段信息，如两张不同的火车票可能有相同的发站、到站等。而在不同的应用研究中，往往关注的只是其中的部分字段。因此，如何针对有效字段进行铁路客货票据的合并，以形成所需要的 OD 数据，就成为铁路运输系统后续优化模型计算的重要前期工作。由于直接合并铁路客货票记录275 将产生大量的计算时间，数据处理产生的时间成本很大。本文针对这一现象，提出了一种基于稀疏 OD矩阵特征的关联归并法，有效地提高了原始 OD 记录的的合并效率。作为检验例子，本文根据中国铁路客流分布特征随机产生了一组数据，以 20 万条记录为步长，从 20 万到 600 万（相当于铁路一天的客票数量）共计 30 文件，采用直接归并法和关联归并法进行比较，对于 600万条客票记录的数据量，关联归并法时间消耗仅仅是直接归并法的百分之280 一，验证了本文所提方法的有效性和实用性。关键词：数据归并；铁路；OD对；稀疏矩阵；票据中图分类号：TP274

联系我们

智库文档公众号

客服微信

基于稀疏矩阵特点的铁路原始客货流归并方法研究.pdf

下载

标签

联系我们

意见反馈