Regex - Reading CSV files like MS Excel does

The expression regex e(",") handles files in a clean format such as: field 1,field 2,field 3 I would like to read some CSV files where there may be fields that contain a comma (enclosed in ""): 1999,Smith, Mike, "Smith, Mike", 55 1999,Doe, Jane, "Doe, Jane", 45 And if possible, I would like to handle commas and quotes within the field: 1999,Doe, Jane, "Doe, Jane "Happy Gurl"", 45 Can someone help me with this please? Thank you very much

Jeff Dunlap wrote:
The expression regex e(",") handles files in a clean format such as:
field 1,field 2,field 3
I would like to read some CSV files where there may be fields that contain a comma (enclosed in ""):
1999,Smith, Mike, "Smith, Mike", 55 1999,Doe, Jane, "Doe, Jane", 45
And if possible, I would like to handle commas and quotes within the field:
1999,Doe, Jane, "Doe, Jane "Happy Gurl"", 45
googling regex csv yields several hits the first of which is: http://geekswithblogs.net/mwatson/archive/2004/09/04/10658.aspx Jeff

I've been looking at those googled expressions and they don't seem to handle
the CSV file the way they should.
I'll continue experimenting or use Boost.Tokenizer for this purpose as
suggested by Eric Malenfant.
Thanks
"Jeff Flinn"
Jeff Dunlap wrote:
The expression regex e(",") handles files in a clean format such as:
field 1,field 2,field 3
I would like to read some CSV files where there may be fields that contain a comma (enclosed in ""):
1999,Smith, Mike, "Smith, Mike", 55 1999,Doe, Jane, "Doe, Jane", 45
And if possible, I would like to handle commas and quotes within the field:
1999,Doe, Jane, "Doe, Jane "Happy Gurl"", 45
googling regex csv yields several hits the first of which is:
http://geekswithblogs.net/mwatson/archive/2004/09/04/10658.aspx
Jeff
-- ELKNews FREE Edition - Empower your News Reader! http://www.atozedsoftware.com

I experimented with the tokenizer at one time. The escaped tokenizer worked
after a fashion. If you don't have null elements
(e.g.,....,item1,,item3,.....) it is not too bad. However if you define
CharTokens cs(boost::keep_empty_tokens);
EscapedTokenizer et(buffer,cs);
you can get the empty tokens. The downside is you also get the separators as
I recall but you can skip over the separators as you iterate over the
result. I think if you iterate with EscapedTokenizer::iterator you will
still get the surrounding quotes for any item that is quoted.
Another heavyweight scheme is using Spirit. I found an early version of a
test that I think worked although I would be hard-pressed at the moment to
explain it.
string buffer("\"string\",\"string with an embedded
\\\"\",123,0.123,,2"); // example string for parsing
rule<> list_csv, list_csv_item;
vector<string> vec_list;
list_csv_item =
confix_p('\"', *c_escape_ch_p, '\"')
| longest_d[real_p | int_p]
;
list_csv = list_p(
!list_csv_item[append(vec_item)],
','
)[append(vec_list)]
;
parse_info<> result = parse(buffer,list_csv);
if (result.hit) {
// Got something
if (result.full) {
// Complete
} else {
// Not quite everything
}
// iterate through vec_list or vec_item for data
} else {
// Didn't parse
}
With very simple csv files you could use split() from string_algo.
vector<string> sv;
split(sv,buffer,is_any_of(","));
You end up with a vector of the elements. A null item will be present in the
vector. Embedded separators won't work in this scheme as I recall.
There may be easier schemes. I don't recall seeing a regex scheme. Spirit I
think actually uses the Tokenizers under the covers and can use regex in
some cases.
Larry
----- Original Message -----
From: "Jeff Dunlap"
I've been looking at those googled expressions and they don't seem to handle the CSV file the way they should.
I'll continue experimenting or use Boost.Tokenizer for this purpose as suggested by Eric Malenfant.
Thanks
"Jeff Flinn"
wrote in message news:gfk7o3$3mc$1@ger.gmane.org... Jeff Dunlap wrote:
The expression regex e(",") handles files in a clean format such as:
field 1,field 2,field 3
I would like to read some CSV files where there may be fields that contain a comma (enclosed in ""):
1999,Smith, Mike, "Smith, Mike", 55 1999,Doe, Jane, "Doe, Jane", 45
And if possible, I would like to handle commas and quotes within the field:
1999,Doe, Jane, "Doe, Jane "Happy Gurl"", 45
googling regex csv yields several hits the first of which is:
http://geekswithblogs.net/mwatson/archive/2004/09/04/10658.aspx
Jeff
--
ELKNews FREE Edition - Empower your News Reader! http://www.atozedsoftware.com
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Jeff Dunlap, le 13 novembre 2008 23:18:
The expression regex e(",") handles files in a clean format such as:
field 1,field 2,field 3
I would like to read some CSV files where there may be fields that contain a comma (enclosed in ""):
1999,Smith, Mike, "Smith, Mike", 55 1999,Doe, Jane, "Doe, Jane", 45
escaped_list_separator in Boost.Tokenizer seems to do what you need. Éric Malenfant ---------------------------------------------

Thanks for the info. I really was hoping to use Regex and if I can't figure
it out, I'll definately use Tokenizer for this purpose.
"Eric MALENFANT"
The expression regex e(",") handles files in a clean format such as:
field 1,field 2,field 3
I would like to read some CSV files where there may be fields that contain a comma (enclosed in ""):
1999,Smith, Mike, "Smith, Mike", 55 1999,Doe, Jane, "Doe, Jane", 45
escaped_list_separator in Boost.Tokenizer seems to do what you need. Éric Malenfant --------------------------------------------- -- ELKNews FREE Edition - Empower your News Reader! http://www.atozedsoftware.com
participants (4)
-
Eric MALENFANT
-
Jeff Dunlap
-
Jeff Flinn
-
Larry