Why is Python faster than C++ in this case?
A program in both Python and C++ is given below, which performs the following task: read white-space delimited words from stdin, print the unique words sorted by string length along with a count of each unique word to stdout. The format for a line of the output is: length, count, word.
For exmaple, with this input file (488kB of a thesaurus) http://pastebin.com/raw.php?i=NeUBQ22T
The output, with formatting, is this:
1 57 "
1 1 n
1 1 )
1 3 *
1 18 ,
1 7 -
1 1 R
1 13 .
1 2 1
1 1 S
1 5 2
1 1 3
1 2 4
1 2 &
1 91 %
1 1 5
1 1 6
1 1 7
1 1 8
1 2 9
1 16 ;
1 2 =
1 5 A
1 1 C
1 5 e
1 3 E
1 1 G
1 11 I
1 1 L
1 4 N
1 681 a
1 2 y
1 1 P
2 1 67
2 1 y;
2 1 P-
2 85 no
2 9 ne
2 779 of
2 1 n;
...
Here is the program in C++
#include <vector>
#include <string>
#include <iostream>
#include <set>
#include <map>
bool compare_strlen (const std::string &lhs, const std::string &rhs) {
return (lhs.length() < rhs.length());
}
int main (int argc, char *argv[]) {
std::string str;
std::vector<std::string> words;
/* Extract words from the input file, splitting on whitespace */
while (std::cin >> str) {
words.push_back(str);
}
/* Extract unique words and count the number of occurances of each word */
std::set<std::string> unique_words;
std::map<std::string,int> word_count;
for (std::vector<std::string>::iterator it = words.begin();
it != words.end(); ++it) {
unique_words.insert(*it);
word_count[*it]++;
}
words.clear();
std::copy(unique_words.begin(), unique_words.end(),
std::back_inserter(words));
// Sort by word length
std::sort(words.begin(), words.end(), compare_strlen);
// Print words with length and number of occurances
for (std::vector<std::string>::iterator it = words.begin();
it != words.end(); ++it) {
std::cout << it->length() << " " << word_count[*it] << " " <<
*it << std::endl;
}
return 0;
}
Here is the Program in Python:
import fileinput
from collections import defaultdict
words = set()
count = {}
for line in fileinput.input():
line_words = line.split()
for word in line_words:
if word not in words:
words.add(word)
count[word] = 1
else:
count[word] += 1
words = list(words)
words.sort(key=len)
for word in words:
print len(word), count[word], word
For the C++ program, the compiler used was g++ 4.9.0 with the -O3 flag.
The version of Python used was 2.7.3
Time taken for the C++ program:
time ./main > measure-and-count.txt < ~/Documents/thesaurus/thesaurus.txt
real 0m0.687s
user 0m0.559s
sys 0m0.123s
Time taken for the Python program:
time python main.py > measure-and-count.txt < ~/Documents/thesaurus/thesaurus.txt
real 0m0.369s
user 0m0.308s
sys 0m0.029s
The Python program is much faster than the C++ program, and is relatively even faster with larger input sizes. What's going on here? Is my use of the C++ STL incorrect?
Edit: As suggested by a comment and an answer, I changed the C++ program to use std::unordered_set
and std::unordered_map
.
The following lines were changed
#include <unordered_set>
#include <unordered_map>
...
std::unordered_set<std::string> unique_words;
std::unordered_map<std::string,int> word_count;
The compilation command:
g++-4.9 -std=c++11 -O3 -o main main.cpp
This improved performance only marginally:
time ./main > measure-and-count.txt < ~/Documents/thesaurus/thesaurus.txt
real 0m0.604s
user 0m0.479s
sys 0m0.122s
Edit2: A much faster program in C++
This is a combination of NetVipeC's solution, Dieter Lücking's solution, and the top answer to this question. The real performance killer was cin
using an unbuffered read by default. Solved with std::cin.sync_with_stdio(false);
. This solution also uses a single container, taking advantage of the ordered map
in C++.
#include <vector>
#include <string>
#include <iostream>
#include <set>
#include <map>
struct comparer_strlen {
bool operator()(const std::string& lhs, const std::string& rhs) const {
if (lhs.length() == rhs.length())
return lhs < rhs;
return lhs.length() < rhs.length();
}
};
int main(int argc, char* argv[]) {
std::cin.sync_with_stdio(false);
std::string str;
typedef std::map<std::string, int, comparer_strlen> word_count_t;
/* Extract words from the input file, splitting on whitespace */
/* Extract unique words and count the number of occurances of each word */
word_count_t word_count;
while (std::cin >> str) {
word_count[str]++;
}
// Print words with length and number of occurances
for (word_count_t::iterator it = word_count.begin();
it != word_count.end(); ++it) {
std::cout << it->first.length() << " " << it->second << " "
<< it->first << 'n';
}
return 0;
}
Runtime
time ./main3 > measure-and-count.txt < ~/Documents/thesaurus/thesaurus.txt
real 0m0.106s
user 0m0.091s
sys 0m0.012s
Edit3: A nice and concise version of the Python program was provided by Daniel, it runs in about the same time as the version above:
import fileinput
from collections import Counter
count = Counter(w for line in fileinput.input() for w in line.split())
for word in sorted(count, key=len):
print len(word), count[word], word
Runtime:
time python main2.py > measure-and-count.txt.py < ~/Documents/thesaurus/thesaurus.txt
real 0m0.342s
user 0m0.312s
sys 0m0.027s
Test with this, it must be faster than the original C++.
Changes are:
words
to save the words (there will be saved already in word_count). unique_words
(in word_count are only the unique words). Eliminated the sort of the words (the order was updated in the map, now the words in the map are order by length and the words with same length are lexicographically order.
#include <vector>
#include <string>
#include <iostream>
#include <set>
#include <map>
struct comparer_strlen_functor {
operator()(const std::string& lhs, const std::string& rhs) const {
if (lhs.length() == rhs.length())
return lhs < rhs;
return lhs.length() < rhs.length();
}
};
int main(int argc, char* argv[]) {
std::cin.sync_with_stdio(false);
std::string str;
typedef std::map<std::string, int, comparer_strlen_functor> word_count_t;
/* Extract words from the input file, splitting on whitespace */
/* Extract unique words and count the number of occurances of each word */
word_count_t word_count;
while (std::cin >> str) {
word_count[str]++;
}
// Print words with length and number of occurances
for (word_count_t::iterator it = word_count.begin(); it != word_count.end();
++it) {
std::cout << it->first.length() << " " << it->second << " " << it->first
<< "n";
}
return 0;
}
New version of the reading loop, to read line by line and split. Need #include <boost/algorithm/string/split.hpp>
while (std::getline(std::cin, str)) {
for (string_split_iterator It = boost::make_split_iterator(
str, boost::first_finder(" ", boost::is_iequal()));
It != string_split_iterator(); ++It) {
if (It->end() - It->begin() != 0)
word_count[boost::copy_range<std::string>(*It)]++;
}
}
Testing in Core i5, 8GB RAM, GCC 4.9.0, 32bits, run in 238ms. Updated the code with std::cin.sync_with_stdio(false);
and n
as proposed.
Making three changes, omitting the extra vector (which you not have in python), reserve memory for the word-vector and avoiding the endl (!) in the output:
#include <algorithm>
#include <vector>
#include <string>
#include <iostream>
#include <set>
#include <map>
bool compare_strlen (const std::string &lhs, const std::string &rhs) {
return (lhs.length() < rhs.length());
}
int main (int argc, char *argv[]) {
/* Extract words from the input file, splitting on whitespace */
/* Also count the number of occurances of each word */
std::map<std::string, int> word_count;
{
std::string str;
while (std::cin >> str) {
++word_count[str];
}
}
std::vector<std::string> words;
words.reserve(word_count.size());
for(std::map<std::string, int>::const_iterator w = word_count.begin();
w != word_count.end();
++w)
{
words.push_back(w->first);
}
// Sort by word length
std::sort(words.begin(), words.end(), compare_strlen);
// Print words with length and number of occurances
for (std::vector<std::string>::iterator it = words.begin();
it != words.end();
++it)
{
std::cout << it->length() << " " << word_count[*it] << " " <<
*it << 'n';
}
return 0;
}
Gives:
Original:
real 0m0.230s
user 0m0.224s
sys 0m0.004s
Improved:
real 0m0.140s
user 0m0.132s
sys 0m0.004s
More Improved by adding std::cin.sync_with_stdio(false);
See OregonTrail's question):
real 0m0.107s
user 0m0.100s
sys 0m0.004s
And NetVipeC's solution with std::cin.sync_with_stdio(false);
and the replacement of std::endl by 'n':
real 0m0.077s
user 0m0.072s
sys 0m0.004s
Python:
real 0m0.146s
user 0m0.136s
sys 0m0.008s
std::vector<std::string> words;
/* Extract words from the input file, splitting on whitespace */
while (std::cin >> str) {
words.push_back(str);
}
This requires constantly repeating allocate/copy/free operations as the vector grows. Either pre-allocate the vector or use something like a list.
链接地址: http://www.djcxy.com/p/31544.html