- 1 -
中国科技论文在线
Deriving OD matrix from Railway ticket records based on
merge sort method#
LIN Ruixi
1
, Wu Jianping
2
, Wang Jiaxi
2
, LIN Boliang
2**
5
(1. Department of Electronics and Computer Engineering, The Hong Kong University of Science
and Technology, Hong Kong, 999077;
2. School of Traffic and Transportation, Beijing Jiaotong University, Beijing 100044, China)
Foundations: National Natural Science Foundation of China(No. 51178031); project of China Railways
Corporation( 2013X005-A)
Brief author introduction:LIN Ruixi(1993-),Female,Undergraduate student,Information Iechnology,IC Design
Correspondance author: LIN Boliang(1961-),Male,Professor,Railway Operation Management,Transportation
Systems Network Design,Transportation and Logistics,Intelligent Transportation System. E
Abstract: In a transportation system, OD pairs are usually accumulated by large quantities of primary
records. The records may come from train tickets, freight invoices, and flight trip tickets etc. In 2013, 10
billion passengers are transported by China railway. Each train ticket contains information of
origin and destination stations, train number, and train class etc. Only several fields of information,
such as origin and destination, are required according to specific research purposes. The rest of the
fields are unnecessary and need to be merged. A direct merging method in which each record compares
with all existing OD pairs will consume a lot of time when there are large quantities of records. This 15
work will propose a sorting-based merging method, which applies to general OD matrices. The time
complexity is scale of O(NlogN) . For tests, 30 samples are used for numerical analyses. The results
show that the efficiency is far better than that of the direct merging method.
Key words: OD matrix, Railway ticket, merging, sorting.
20
0 Introduction
Origin-destination (OD) pairs are fundamental data in the field of transportation, and are
widely used in various applications. For example, predictive OD matrices of the long-term future
are the basis of optimizing transportation network design, while historical OD matrices usually
apply to design optimal transportation plans. The size of an OD matrix is always huge in a real 25
transportation system. Each element of the matrix, . an OD pair, is accumulated by numerous
passenger tickets or freight invoices. For a typical transportation system, one of the fundamental
problems is to efficiently integrate the records into a simple OD matrix.
At present, there are over 100 thousands kilometers and thousands of stations in China
railway systems. The length of high-speed railway is over 11,000 kilometers. There have been 30
4894 passenger trains including 2660 high-speed trains(EMU) in China railway system since July
2014, and according to statistics, in 2013, billion passengers travelled by China railway
[1]
.
The average daily passenger traffic volume reaches million. Each freight station is associated
with about 104 OD pairs, and most passenger stations are associated with 95 OD pairs on
average
[1]
. It is clear that the OD matrix of China rail network is quite sparse. Based on this 35
characteristic, the author of [1] proposed an origin-associated merging method to integrate the
primary ticket records effectively. However, the algorithm results in low efficiency where the OD
matrix does not show a sparse characteristic. To deal with more general cases, this work proposes
a merging method based on a sorting algorithm. The proposed method sorts the primary ticket
records by origins, then sorts by destinations, and finally merges the sorted tickets. The efficiency 40
of this method mainly depends on the sorting algorithm.
Typical sort algorithms include bubble sort, selection sort, insert sort, shell sort, quick sort,
merge sort and heap sort
[2~8]
. Bubble sort and selection sort are stable but only works well for
- 2 -
中国科技论文在线
short sequences of records. Compared to merge sort, the time complexity of insert sort is higher,
while shell, quick, and heap sort are unstable. The sequences of ticket records in this work are 45
large, hence we adopts the comparatively stable and efficient merge sort to preprocess our data.
The rest of the paper is organized as follows: Section 1 describes two existing merging
methods. In Section 2, we give a simple illustration on the procedures of merge sort. A detailed
description of the merging method based on sort is given in section 3. Computational results and
conclusions are presented in section 4 and 5. 50
1 Direct merging and origin-associated merging method
Description of direct merging algorithm
Let
initS be a set of N ticket records. Without loss of generality, we assume that each
record only consists of origin and destination stations. Let
mergeS stores the OD pairs after
merging the records. Move the first record in
initS to mergeS . During each comparison, compare 55
the origin and destination of a record in
initS with the origins and destinations of all records
in
mergeS . If any two records contain the same origin and same destination, merge the record
in
initS into the existing record in mergeS ; otherwise add that record into mergeS . Repeat the above
step until all records in
initS are compared. Assume there are M origin-destination pairs after
merging N ticket records. The complexity of this algorithm is 4/)1( MN [1]. 60
Origin-associated merging method
Considering that the real railway OD matrix is highly sparse, the author of [1] proposes a
novel merging approach named origin-associated merging method. During each comparison,first
compare the origins of two records, if their origins are same, compare their associated destinations.
In this way the number of comparison times can be effectively reduced. The complexity of the 65
algorithm is 4/)1(4/)1( DYN , where Y is number of all origins of records and D is
the number of destinations that each origin is associated with.
2 The Merge Sort Algorithm
The merge sort (MS) algorithm is an efficient sorting algorithm built on the basis of merging
operations. This algorithm exemplifies the divide-and-conquer algorithms. Suppose the original 70
sequence contains N records, denoted by )(NS . We can divide this sequence into two
subsequences, and again divide each subsequence into two subsequences. Repeat dividing until
)(NS is divided into N subsequences. Eventually, each subsequence will contain one element.
Then merge every two subsequences until the whole sequence is sorted. The nature of this
algorithm is dichotomy. Besides dichotomy, trichotomy or even K dividing method can be 75
utilized. In general cases, dichotomy works well. Considering that the complexity of segmentation
methods will grow rapidly as the number of segments increase, this work adopts the principal of
dichotomy in the proposed method. Using dichotomy
[9]
, we segment )(NS in the following
manner:
- 3 -
中国科技论文在线
80
Fig. 1 The operation of merge sort on the array )(NS
After division, it will become a tree with N elements in each layer. For two sorted
sequences )(nA and )(mB , the time consumed in merge-sorting them is nm , which is in the
complexity scale of )(nO . Therefore, the time used for each layer is in the scale of )(nO .
In addition, the depth of the tree is Nlog . Multiplying the time for each layer with the depth, we 85
can obtain the total time spent on the sort process is )log()(*log)( NNONONNT .
The detailed MS algorithm is described by the following steps:
Step 1: Divide the sequence of length N into two subsequences of length 2/N .
Step 2: Merge sort the two subsequences respectively.
Step 3: Combine the two sorted subsequences into a single sorted sequence. 90
It is not hard to find that the core of this algorithm is the merging process, which is a recursive
process.
3 The sort merge method
Description of the sort merge method
Let initS be a set of N ticket records. Define B as a temporary set which stores the 95
records with the same origin. And let C stores the OD pairs after merging the records..
Firstly, empty the set initS , B and C . Add all records into initS and sort them according
to the origin by MS algorithm. Secondly, move the first record of initS into B . For the kth
record
init
k Sr , if the origin are same with the first record in set B , add this record into B .
Otherwise, let m be the number of all elements in set C . For each element Ch j , let jn 100
denote the ticket number , move the first record of B into C . Let 1 mm and 1mn .
For the ith record in set B , if the destination of this record is the same with the mth record
of C , 1 mm nn . Otherwise, move the ith record into C , and let 1 mm and 1mn .
Repeat this operation until all elements in B are compared. The detailed algorithm is illustrated in
the flow chart. 105
S(N)
S[1:N/2]
S[1:N1/2] S[N1/2+1:N2]
S[1:N2/2] S[N2/2+1:N3] S[1:N2/2]
S[N2/2+1:N3]
S[N/2+1:N]
S[1:N1/2]
S[N1/2+1:N2]
S[1:N2/2]
S[N2/2+1:N3]
S[1:N2/2]
S[N2/2+1:N3]
……
S[1:1] S[1:1] …
N1=N/2
N2=N/4
N3=N/8
Nj=N/2
j
Nk=1,k= log2N
- 4 -
中国科技论文在线
Flow chart of sort merge method
In figure 2, the two blue diamonds with MS notation represent two subprograms. The square in
pink denoted by Merge the record in set B and move the result to set C is another subprogram 110
shown in figure 3.
Y
Y
N
N
N
Sort N records of S
init
M
S
Define set B and move inti1 Sr into B . Let 1k
k=k+1
? Nk
M
S
Add all N ticket records into set intiS
Move kr into B
Y Are the origin of
kr
and 1kr the same ?
END
START
Sort all records of B
according to the destinations.
Merge the record in set B
and move the result to set C
? Nk Empty B
- 5 -
中国科技论文在线
Subprogram of sorting by destinations and merging
Complexity Analysis 115
After we have sorted the all N records by origins, assume there are m origins whose
names are independent, with each origin associated with mNNN ,,, 21 records. The number
of all records can be expressed as
m
i
iNN
1
. To sort the iN records of the ith origin by
destinations, the computational time will be ii NN log . The time used to sort the m sequences
1 mm
Move Bri into C , incorporate the ticket number field
into ir , denote the new record as mh and let 1mn .
Y
Are the destination of
ir and 1ir the same ?
Let 1i . Let m be the number of all elements in set C .
For each element Ch j , let jn denote the ticket number
of jh .
START
N
Y
END
1 mm nn ; 1 ii
N
Bi
? 2B
Y
N
- 6 -
中国科技论文在线
by destinations will be: 120
mm NNNNNN logloglog 2211 (1)
Set mNNNI ,,,max 21max , then we can obtain
max2211 loglogloglog INNNNNNN mm (2)
In this way, the number of comparisons in merging is only 1N when the sequence has been
sorted by origins and by destinations. Since NI max , the sort merging method leads to a time 125
complexity of )log( NNO .
4 Algorithm testing and computational results
Complexity Analysis
For the convenience of comparison, we also use the samples from [1] to measure the
efficiency of the proposed method in this work. The author of [1] uses 30 samples of ticket records 130
for computation, all of which are corresponding with sparse OD matrices.
The computation results of our method are shown in table 1.
Tab. 1. Computational results for sparse OD matrices instances
No. N M
1T (sec) 2T (sec) 3T (sec) 12 /TT 13 /TT 23 /TT
1 200000 126242 98 1 1 % % %
2 400000 172681 296 4 2 % % %
3 600000 189789 524 6 3 % % %
4 800000 196185 673 8 4 % % %
5 1000000 198557 872 10 5 % % %
6 1200000 199385 1054 11 6 % % %
7 1400000 199697 1239 14 7 % % %
8 1600000 199835 1449 15 8 % % %
9 1800000 199871 1639 18 9 % % %
10 2000000 199897 1843 20 10 % % %
11 2200000 199912 2057 22 10 % % %
12 2400000 199914 2303 24 12 % % %
13 2600000 199915 2510 26 13 % % %
14 2800000 199919 2671 29 15 % % %
15 3000000 199914 2926 32 16 % % %
16 3200000 199915 3082 35 17 % % %
17 3400000 199910 3336 37 18 % % %
18 3600000 199911 3540 39 19 % % %
19 3800000 199912 3715 39 20 % % %
- 7 -
中国科技论文在线
20 4000000 199923 3864 42 21 % % %
21 4200000 199913 3987 42 23 % % %
22 4400000 199915 4222 44 23 % % %
23 4600000 199922 4410 46 24 % % %
24 4800000 199924 4577 47 25 % % %
25 5000000 199929 4820 49 27 % % %
26 5200000 199923 4977 51 28 % % %
27 5400000 199925 5251 52 29 % % %
28 5600000 199928 5622 55 30 % % %
29 5800000 199921 5824 55 31 % % %
30 6000000 199916 5906 59 32 % % %
In table 1, the first column represents the sample indices, the second column N is the number 135
of tickets in each sample. The step size of the sample number is 200,000 items. The 30
th
sample
contains 6 million tickets. The third column M is the number of OD pairs after merging. The
forth column is the reduction ratio, . NM / . 1T , 2T 和 3T represents the computation
time of the direct merging method, origin-associated merging and sort merging method. The last
three columns are the ratios of the computational times between them. The last column shows that 140
the sort merging method and the origin-associated merging method works competitively, the
merging time of sort merging is about half of that of origin-associated merging, . 23 TT .
The time curves of the two methods are illustrated in figure 4.
Computational time of two merging methods for sparse OD matrices 145
Complexity Analysis
To test the performance in non-sparse matrices, we assume again that there are 2000 stations
in the transportation network, denoted by A0001~A2000. The origins of the ticket records are
- 8 -
中国科技论文在线
randomly generated from A0001~A2000. Contrast to [1], the destinations range from 2000
stations labelled by A0001~A2000 instead of only 100 stations. Records with identical origin and 150
destination will be excluded. The programs run on the same personal computer with Intel(R)
Core(TM)i5 processor as in [1]. The computational results are shown in table 2.
Tab. 2. Computational results for non-sparse OD matrices instances
No. N M
2T (sec) 3T (sec) 23 /TT
1 200000 195017 3 1 %
2 400000 380545 4 3 %
3 600000 557472 10 4 %
4 800000 725347 12 5 %
5 1000000 884467 18 7 %
6 1200000 1036824 22 7 %
7 1400000 1181961 26 8 %
8 1600000 1318598 33 10 %
9 1800000 1449510 40 10 %
10 2000000 1573798 45 11 %
11 2200000 1691447 53 13 %
12 2400000 1804570 63 14 %
13 2600000 1911197 71 15 %
14 2800000 2012880 78 15 %
15 3000000 2110633 88 18 %
16 3200000 2202549 92 17 %
17 3400000 2290739 102 18 %
18 3600000 2373181 112 21 %
19 3800000 2451604 117 21 %
20 4000000 2528324 128 22 %
21 4200000 2599760 130 25 %
22 4400000 2668570 136 26 %
23 4600000 2733313 143 27 %
24 4800000 2793936 148 27 %
25 5000000 2853626 158 28 %
26 5200000 2909563 168 30 %
27 5400000 2962193 177 32 %
28 5600000 3012856 187 32 %
29 5800000 3060224 193 33 %
- 9 -
中国科技论文在线
30 6000000 3106888 202 34 %
Theoretically, for a transportation network of 2000 stations, the possible number of OD pairs
is 2000×1999 (around 4 million), while the randomly generated 6 million records lead to about 3 155
million merged OD pairs. As the sample size enlarges, the number of OD pairs approaches 4
million. Table 2 shows that the reduction ratio can reach 52% for non-sparse OD matrices, given
that any two origin and destination pair has equal probable transportation demands. The curve in
figure 5 is in line with this asymptotic tendency.
160
Reduction ratio for non-sparse OD matrices instances
In addition, table 2 also indicates that the sort merging method have higher performances
than the other method for non-sparse OD matrices. Under the situation of 6 million samples, the
time using sort merging is only around 1/6 of that of the origin-associated merging. The time
curves of the two methods are shown in figure 6. 165
- 10 -
中国科技论文在线
Computational time of two merging methods for non-sparse OD matrices
From table 1 and table 2, it can be concluded that sort merging is insensitive to the sparseness
of an OD matrix. This feature can be easily found in figure 7.
170
Fig. 7 The time characteristics of sort merging for sparse and non-sparse OD matrices
The pink curve in figure 7 represents the time performance for non-sparse OD matrices, and
the blue curve represents that of the sparse OD matrices. Both curves are approximately linear
with a slope of . The pink curve almost coincide with the blue one, which indicates that the
performance of sort merging is not likely to be affected by the sparseness of the samples. In 175
contrast to sort merging is the origin-associated merging, whose performance is highly sensitive to
the sparseness of the samples. This feature of origin-associated merging can be seen from figure 8.
- 11 -
中国科技论文在线
Fig. 8 The time characteristics of origin-associated merging for sparse and non-sparse OD matrices
Figure 8 displays time characteristics of origin-associated merging under both situations of 180
sparse and non-sparse OD matrices samples. The pink and blue curve denote the sparse and
non-sparse cases respectively. It is clearly seen that the origin-associated merging efficiency drops
as the number of records in non-sparse matrices samples grows. For the last sample with as many
as 6 million records, the computational time is 59 seconds for sparse OD matrices samples (see
2T in table 1), while it costs 202 seconds for non-sparse OD matrices samples (see 2T in table 185
2). The consumed time grows more than three times.
5 Conclusion
This work proposes a sort merging method for OD data integration, which sorts all primary
ticket records by origins, then sort all subsets of records with the same origin by destinations,
finally merge the records with the same origin and same destination. The merging efficiency 190
improves significantly using this method. It achieves the same effiecny as origin-associated
merging when dealing with sparse OD matrices, and outperforms the latter method greatly when
the OD matrices are not sparse enough.
Acknowledgements (Optional)
The authors gratefully thank PhD students Xuhong Wen, Lei Chen and Jian Li Beijing Jiao 195
Tong University, for advice on the refinements and embellishments of figures in this work, and
suggestions on the tests and computational analyses respectively.
References
[1] LIN Ruixi,LIN Boliang. An approach to merging railway ticket records into origin-destination pairs based on
sparse matrix characteristic [OL]. [2014-08-25]. China Sciencepaper Online. 200
[2] Oyelaml Olufemi Moses. Improving the performance of bubble sort using a modified diminishing increment
sorting [J]. Scientific Research and , 4(8):740-744.
[3] Iraj H., Afsari M. H. S., Hassanzadeh S. A New External Sorting Algorithm with Selecting the Record List
Location[J]. USEAS Transactions on Communications. 2006, 5(5):909-913. 205
[4] Srirupesh,Ttonhunt,Sweetesh. Bidirectional Expansion - Insertion Algorithm for Sorting[C]. Second
International Conference on Emerging Trends in Engineering and Technology, 2009, 59-62.
- 12 -
中国科技论文在线
[5] , . Stochastic analysis of shell sort[J].,31:442-457.
[6] Wang Xiang. Analysis of the time complexity of quick sort algorithm. Information Management, Innovation
Management and Industrial Engineering (ICIII), 2011 ,408-410. 210
[7] Jafarlou M. Z., Fard P. Y. Heuristic and Pattern Based Merge Sort. Procedia Computer Science. 2011, 3:
322-324.
[8] Xu Xiu. Distributed computing systems using heap sort of multi-layer sub-tree by task match scheduling
algorithm. 3rd International Symposium on Parallel Architectures, Algorithms and ,321-325.
[9] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein. Introduction to Algorithms (3rd 215
edition) . The MIT Press, 2009.
基于归并排序法的铁路OD矩阵形成方法研220
究
林睿熙1,武建平2,王家喜2,林柏梁2
(1. 香港科技大学,电子与计算机工程系,香港,999077;
2. 北京交通大学,交通运输学院,北京,100044)
摘要:在一个运输系统中,OD 对的形成通常是通过大量的原始记录累积而成,原始记录可225
能是火车票、货运单据、飞机票等等。例如,2013年中国铁路系统发送了 亿人次的旅
客。每张火车票包含了发站、到站、车次、席别等字段信息。而不同的研究目标,需要关注
的字段是不同的。如果我们的某项研究只需要关注其中的发到站,则对其余字段就需要进行
合并处理。把每张票和已有的 OD对进行发到站逐一比对的直接归并法将耗费大量的计算时
间。本文提出了一种基于排序法的 OD 矩阵形成技术,其时间复杂度为 O(NlogN)。作为实230
验,我们计算了 30 个样本进行数据分析,结果显示本文所提方法比直接合并法的效率有明
显的提高。
关键词:OD 矩阵,铁路运输票据,归并,排序.
中图分类号:TP274
235