Clang: Open-source C/C++ front end under development

Doug Gregor

31 Aug 2007 31 Aug '07

2:58 p.m.

Hello fellow Boosters, The need for an extensible, complete open-source C++ parser has been discussed on Boost over the years. There are myriad projects that parse approximations to C++ suitable for specific tasks (e.g., documentation generation), but there has never been a community project meant to handle all of C++ or serve as a common platform for new C++ tools. That is going to change, and I hope Boost will be involved. Apple has begun development of "Clang", which aims to be a high- quality C/C++ parser and compiler. Clang is designed as a library for C++ tools, from IDE-supported tools like indexing, searching, and refactoring to documentation tools, static checkers, and a full C++ front end to the Low Level Virtual Machine (LLVM). LLVM is an advanced, open-source compiler infrastructure supporting all of the back-end functionality needed for a C++ compiler. Like LLVM, Clang is written in a modern C++ style, with a clean separation between the different phases of translation in C++ (lexical analysis, parsing, semantic analysis, lowering, code generation, etc.), all of which make Clang a pleasure to work with. Clang is still in the early stages of development. At present, it parses and type-checks most of C, with code generation for some simple constructs. C++ support is not available yet, but Apple has dedicated significant resources to making Clang a successful C++ parser and compiler. That said, Clang needs our help. Collectively, Boost developers have more knowledge about and a deeper understanding of C++ than any other community within C++. We have the C++ expertise to help implement the language well, and the library-design savvy to help build powerful, accessible interfaces that simplify the task of building new and improved tools for C++. The Clang team is asking for help implementing C and C++ features and are very open to new contributors. Clang has the potential to be a top-notch, open-source C++ compiler and tool platform, from which we would all benefit. Let's help Clang get there sooner! Links: Clang: http://clang.llvm.org/ LLVM: http://llvm.org/ Steve Naroff's talk motivating Clang: - Video: http://llvm.org/devmtg/2007-05/09-Naroff-CFE.mov - Slides: http://llvm.org/devmtg/2007-05/09-Naroff-CFE.pdf Clang announcement: http://lists.cs.uiuc.edu/pipermail/llvmdev/2007- July/009817.html Cheers, Doug

Show replies by date

Fei Liu

31 Aug 31 Aug

3:06 p.m.

Doug Gregor wrote:

...

Hello fellow Boosters,

The need for an extensible, complete open-source C++ parser has been discussed on Boost over the years. There are myriad projects that parse approximations to C++ suitable for specific tasks (e.g., documentation generation), but there has never been a community project meant to handle all of C++ or serve as a common platform for new C++ tools. That is going to change, and I hope Boost will be involved.

Apple has begun development of "Clang", which aims to be a high- quality C/C++ parser and compiler. Clang is designed as a library for C++ tools, from IDE-supported tools like indexing, searching, and refactoring to documentation tools, static checkers, and a full C++ front end to the Low Level Virtual Machine (LLVM). LLVM is an advanced, open-source compiler infrastructure supporting all of the back-end functionality needed for a C++ compiler. Like LLVM, Clang is written in a modern C++ style, with a clean separation between the different phases of translation in C++ (lexical analysis, parsing, semantic analysis, lowering, code generation, etc.), all of which make Clang a pleasure to work with.

Clang is still in the early stages of development. At present, it parses and type-checks most of C, with code generation for some simple constructs. C++ support is not available yet, but Apple has dedicated significant resources to making Clang a successful C++ parser and compiler.

That said, Clang needs our help. Collectively, Boost developers have more knowledge about and a deeper understanding of C++ than any other community within C++. We have the C++ expertise to help implement the language well, and the library-design savvy to help build powerful, accessible interfaces that simplify the task of building new and improved tools for C++. The Clang team is asking for help implementing C and C++ features and are very open to new contributors.

Clang has the potential to be a top-notch, open-source C++ compiler and tool platform, from which we would all benefit. Let's help Clang get there sooner!

Links: Clang: http://clang.llvm.org/ LLVM: http://llvm.org/ Steve Naroff's talk motivating Clang: - Video: http://llvm.org/devmtg/2007-05/09-Naroff-CFE.mov - Slides: http://llvm.org/devmtg/2007-05/09-Naroff-CFE.pdf Clang announcement: http://lists.cs.uiuc.edu/pipermail/llvmdev/2007- July/009817.html

Cheers, Doug

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Sounds promising, but here: [root@maple clang]# make -j3 Makefile:4: ../../Makefile.common: No such file or directory make: *** No rule to make target `../../Makefile.common'. Stop. [root@maple clang]# ls AST clang.xcodeproj docs include Lex Makefile NOTES.txt README.txt test www Basic CodeGen Driver INPUTS LICENSE.TXT ModuleInfo.txt Parse Sema TODO.txt [root@maple clang]# find -name Makefile.common [root@maple clang]#

Doug Gregor

3:32 p.m.

On Aug 31, 2007, at 11:06 AM, Fei Liu wrote:

...

Sounds promising, but here: [root@maple clang]# make -j3 Makefile:4: ../../Makefile.common: No such file or directory make: *** No rule to make target `../../Makefile.common'. Stop.

The Clang web page (http://clang.llvm.org/) describes how to build Clang: http://lists.cs.uiuc.edu/pipermail/llvmdev/2007-July/009817.html Please note also that they aren't at the point where bug reports are useful by themselves, because they know they have no covered all of C or C++. Bug reports with patches, of course, are very useful :) - Doug

Fei Liu

3:45 p.m.

Doug Gregor wrote:

...

On Aug 31, 2007, at 11:06 AM, Fei Liu wrote:

...
Sounds promising, but here: [root@maple clang]# make -j3 Makefile:4: ../../Makefile.common: No such file or directory make: *** No rule to make target `../../Makefile.common'. Stop.

The Clang web page (http://clang.llvm.org/) describes how to build Clang:

http://lists.cs.uiuc.edu/pipermail/llvmdev/2007-July/009817.html

Please note also that they aren't at the point where bug reports are useful by themselves, because they know they have no covered all of C or C++. Bug reports with patches, of course, are very useful :)

- Doug _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost I followed the exact instruction on that page to build, but it's obvious from the error message that it's requires a Makefile.common not currently in the documented SVN repository.

Fei

Doug Gregor

3:59 p.m.

On Aug 31, 2007, at 11:45 AM, Fei Liu wrote:

...

Doug Gregor wrote:

...
The Clang web page (http://clang.llvm.org/) describes how to build Clang:

http://lists.cs.uiuc.edu/pipermail/llvmdev/2007-July/009817.html I followed the exact instruction on that page to build, but it's obvious from the error message that it's requires a Makefile.common not currently in the documented SVN repository.

Directory layout matters. The directions say to check out llvm from the LLVM repository, then go into llvm/tools and check out clang there. That way, ../../Makefile.common refers to a file in the LLVM repository. - Doug

Chris Lattner

5:08 p.m.

Fei Liu <feiliu <at> aepnetworks.com> writes:

...

Sounds promising, but here: [root <at> maple clang]# make -j3 Makefile:4: ../../Makefile.common: No such file or directory make: *** No rule to make target `../../Makefile.common'. Stop.

You need to run get LLVM and run configure first. Here's a "quick start guide" which should work. First steps: get LLVM and build it. You don't need llvm-gcc, and you don't need to do make install. I'd strongly suggest building it with objdir=srcdir. $ svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm $ cd llvm $ ./configure $ make This will build LLVM in debug mode. I suggest adding llvm/Debug/bin to your path, which will give you utilities like llvm-as, llc, lli, etc. All of these have --help options, and all the LLVM utilities have pages here: http://llvm.org/docs/CommandGuide/ . Once you have this, check out the clang front-end. Continuing the above: $ cd tools $ svn co http://llvm.org/svn/llvm-project/cfe/trunk clang $ cd clang $ make the make command will build the llvm/Debug/bin/clang tool, which is the front-end. Once you have the clang tool, there are a bunch of different options for different things (see clang --help): you can parse and pretty print C code, run the preprocessor, etc. There are a variety of GCC compatible options. One main feature of the front-end is that we produce excellent diagnostics, particularly for type checking errors etc. This work is still somewhat early on. In particular, our C++ support is very minimal at this point. That said, we have a great foundation, architecture, and a commitment to seeing it through. If you have further questions of comments, please follow up on the cfe-dev list (http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev) to avoid off-topic posts here. Thanks! -Chris

Sohail Somani

4:22 p.m.

Doug Gregor wrote:

...

Apple has begun development of "Clang", which aims to be a high- quality C/C++ parser and compiler. Finally! What guarantees on portability are we expecting? Apple = ppc or x86. I recall that llvm handles multiple targets so perhaps that is my answer.

Also, what exactly does open source mean here? Sohail

Doug Gregor

5:09 p.m.

On Aug 31, 2007, at 12:22 PM, Sohail Somani wrote:

...

...
Apple has begun development of "Clang", which aims to be a high- quality C/C++ parser and compiler. Finally! What guarantees on portability are we expecting? Apple =

Doug Gregor wrote: ppc or x86. I recall that llvm handles multiple targets so perhaps that is my answer.

From the LLVM home page, "[LLVM has] back-ends for the X86, X86-64, PowerPC 32/64, ARM, Thumb, IA-64, Alpha and SPARC architectures, a back-end which emits portable C code, and a Just-In-Time compiler for X86, X86-64, PowerPC 32/64 processors." That's reasonably portable, and of course the more involved the community is in this project, the more likely that Clang will work well on all of these architectures.

...

Also, what exactly does open source mean here?

It's the LLVM license, which is a BSD-like license. You can check it out here: http://llvm.org/svn/llvm-project/cfe/trunk/LICENSE.TXT - Doug

David Abrahams

1 Sep 1 Sep

2:27 p.m.

on Fri Aug 31 2007, Doug Gregor <dgregor-AT-osl.iu.edu> wrote:

...

Hello fellow Boosters,

The need for an extensible, complete open-source C++ parser has been discussed on Boost over the years. There are myriad projects that parse approximations to C++ suitable for specific tasks (e.g., documentation generation), but there has never been a community project meant to handle all of C++ or serve as a common platform for new C++ tools. That is going to change, and I hope Boost will be involved.

Apple has begun development of "Clang", which aims to be a high- quality C/C++ parser and compiler. Clang is designed as a library for C++ tools, from IDE-supported tools like indexing, searching, and refactoring to documentation tools, static checkers, and a full C++ front end to the Low Level Virtual Machine (LLVM). LLVM is an advanced, open-source compiler infrastructure supporting all of the back-end functionality needed for a C++ compiler. Like LLVM, Clang is written in a modern C++ style, with a clean separation between the different phases of translation in C++ (lexical analysis, parsing, semantic analysis, lowering, code generation, etc.), all of which make Clang a pleasure to work with.

Clang is still in the early stages of development. At present, it parses and type-checks most of C, with code generation for some simple constructs. C++ support is not available yet, but Apple has dedicated significant resources to making Clang a successful C++ parser and compiler.

Well, it's about time something like this existed... I always wanted a C++ compiler written in C++, with high-level abstractions, so that I could "easily" develop new features.

...

That said, Clang needs our help. Collectively, Boost developers have more knowledge about and a deeper understanding of C++ than any other community within C++. We have the C++ expertise to help implement the language well, and the library-design savvy to help build powerful, accessible interfaces that simplify the task of building new and improved tools for C++. The Clang team is asking for help implementing C and C++ features and are very open to new contributors.

Is there a list of tasks somewhere? If someone wants to help, where can they find something to do? I note the LLVM website says "a GCC-based C & C++ front-end." I assume that is *not* what we're talking about here? -- Dave Abrahams Boost Consulting http://www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com

Douglas Gregor

2:51 p.m.

On Sat, 2007-09-01 at 10:27 -0400, David Abrahams wrote:

...

...
That said, Clang needs our help. Collectively, Boost developers have more knowledge about and a deeper understanding of C++ than any other community within C++. We have the C++ expertise to help implement the language well, and the library-design savvy to help build powerful, accessible interfaces that simplify the task of building new and improved tools for C++. The Clang team is asking for help implementing C and C++ features and are very open to new contributors.

Is there a list of tasks somewhere? If someone wants to help, where can they find something to do?

The best place to get these answers is on the Clang mailing list at: http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev

...

I note the LLVM website says "a GCC-based C & C++ front-end." I assume that is *not* what we're talking about here?

No, that is not what we're talking about. LLVM can be used as a back-end for GCC's C and C++ parsers, but that involves a lot of cruft from GCC. Clang is a new, from-scratch, all-C++ parser that will be an alternative LLVM front end for C and C++. - Doug

Larry Evans

6:12 p.m.

On 09/01/07 09:51, Douglas Gregor wrote: [snip]

...

No, that is not what we're talking about. LLVM can be used as a back-end for GCC's C and C++ parsers, but that involves a lot of cruft from GCC. Clang is a new, from-scratch, all-C++ parser that will be an alternative LLVM front end for C and C++. Without this cruft, I assume you're very tempted to port your variadic template compiler to clang. Is that right?

On another topic, I took a brief look at: http://llvm.org/cfe/docs/InternalsManual.html It maybe too late now, but I'd think boost's wave might have saved them some time in writing the preprocessor.

Hartmut Kaiser

6:28 p.m.

Larry Evans wrote:

...

[snip]

...
No, that is not what we're talking about. LLVM can be used as a back-end for GCC's C and C++ parsers, but that involves a lot of cruft from GCC. Clang is a new, from-scratch, all-C++ parser that will be an alternative LLVM front end for C and C++. Without this cruft, I assume you're very tempted to port your variadic template compiler to clang. Is that right?

On another topic, I took a brief look at:

http://llvm.org/cfe/docs/InternalsManual.html

It maybe too late now, but I'd think boost's wave might have saved them some time in writing the preprocessor.

Definitely. But I checked and they did a decent job! Their preprocessor is excellent, some stuff isn't implemented yet (i.e. UCNs), but what's there is ok. Regards Hartmut

David A. Greene

6:30 p.m.

On Saturday 01 September 2007 13:12, Larry Evans wrote:

...

It maybe too late now, but I'd think boost's wave might have saved them some time in writing the preprocessor.

Agreed. I've been involved with llvm for about six months now and there's a general fear of using anything from Boost, or templates in general. I'm not meaning to slam the llvm developers. What they've done is really quite good. But they have certain constraints (embedded, low memory, etc.) that makes them hesitant to use more advanced C++ techniques. It would be good if the Boost community could put together some benchmarks and general thoughts about using Boost in such environments, not only for llvm but also for embedded development in general. -Dave

Larry Evans

6:54 p.m.

On 09/01/07 13:30, David A. Greene wrote:

...

On Saturday 01 September 2007 13:12, Larry Evans wrote:

...
It maybe too late now, but I'd think boost's wave might have saved them some time in writing the preprocessor.

Agreed. I've been involved with llvm for about six months now and there's a general fear of using anything from Boost, or templates in general. I'm not meaning to slam the llvm developers. What they've done is really quite good. But they have certain constraints (embedded, low memory, etc.) that makes them hesitant to use more advanced C++ techniques.

My initial reponse to this is "aren't they optimizing first, then correcting?" IOW, why not use templates to ease getting a correct compiler, and *then* worry about satisfying the constraints? Of course, after thinking a bit, I'd guess that's probably oversimplifying.

Chris Lattner

7:33 p.m.

David's characterization is somewhat correct, but is also a bit simplistic. The LLVM project does certainly use templates (including partial specialization etc), but we prefer to keep this as an implementation detail in a library. Exposing "complex" template code through the public interface of a library generally make the library "scary" to those developers who don't consider themselves to be C++ gurus. This design point also reduces build time for the LLVM code itself. -Chris http://nondot.org/sabre http://llvm.org On Sep 1, 2007, at 11:54 AM, Larry Evans <cppljevans@cox-internet.com> wrote:

...

On 09/01/07 13:30, David A. Greene wrote:

...
On Saturday 01 September 2007 13:12, Larry Evans wrote:

...
It maybe too late now, but I'd think boost's wave might have saved them some time in writing the preprocessor.

Agreed. I've been involved with llvm for about six months now and there's a general fear of using anything from Boost, or templates in general. I'm not meaning to slam the llvm developers. What they've done is really quite good. But they have certain constraints (embedded, low memory, etc.) that makes them hesitant to use more advanced C++ techniques.

My initial reponse to this is "aren't they optimizing first, then correcting?" IOW, why not use templates to ease getting a correct compiler, and *then* worry about satisfying the constraints? Of course, after thinking a bit, I'd guess that's probably oversimplifying.

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

David Abrahams

7:52 p.m.

on Sat Sep 01 2007, Chris Lattner <clattner-AT-apple.com> wrote:

...

David's characterization is somewhat correct, but is also a bit simplistic. The LLVM project does certainly use templates (including partial specialization etc), but we prefer to keep this as an implementation detail in a library. Exposing "complex" template code through the public interface of a library generally make the library "scary" to those developers who don't consider themselves to be C++ gurus. This design point also reduces build time for the LLVM code itself.

If you're building this framework as a bunch of libraries, that means expressivity at all kinds of boundaries that would be strictly "internal" in a monolithic compiler will be limited by your idea of what may scare people. Are you sure this constraint is a good idea? Seems likely there are good reasons to templatize a lexer and lots of other components in a C++ compiler. -- Dave Abrahams Boost Consulting http://www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com

Stephen Torri

8:01 p.m.

On Sat, 2007-09-01 at 15:52 -0400, David Abrahams wrote:

...

If you're building this framework as a bunch of libraries, that means expressivity at all kinds of boundaries that would be strictly "internal" in a monolithic compiler will be limited by your idea of what may scare people. Are you sure this constraint is a good idea? Seems likely there are good reasons to templatize a lexer and lots of other components in a C++ compiler.

I know that I could use a C++ parser for a reverse engineering architecture I am creating. There are more uses for a parser and its output (e.g. Abstract Syntax Tree ) are useful for source code reverse engineering. I would hate to have to write my own parser. That is the problem I have with GCC. The parser is hardwired into the C code. I prefer to have a C++ parser that can give me back a AST in order then perform my analysis. Stephen

Chris Lattner

8:09 p.m.

...

...
David's characterization is somewhat correct, but is also a bit simplistic. The LLVM project does certainly use templates (including partial specialization etc), but we prefer to keep this as an implementation detail in a library. Exposing "complex" template code through the public interface of a library generally make the library "scary" to those developers who don't consider themselves to be C++ gurus. This design point also reduces build time for the LLVM code itself.

If you're building this framework as a bunch of libraries, that means expressivity at all kinds of boundaries that would be strictly "internal" in a monolithic compiler will be limited by your idea of what may scare people. Are you sure this constraint is a good idea?

It's much easier to give concrete examples than abstract arguments. As I mentioned, LLVM does use templates heavily in some libraries. Also, note that David is talking more about the LLVM optimizer and code generator libraries than the clang front-end.

...

Seems likely there are good reasons to templatize a lexer and lots of other components in a C++ compiler.

Can you give me a specific example of where making the entire lexer a template would be more useful than adding a couple of virtual method calls? It isn't meaningful to lex anything other than a sequence of char's for example, and efficient buffer management (at least in our context) means that the input to the lexer isn't useful as an iterator interface. The point of most of the LLVM design decisions is not to hide from templates: it is to use the right C++ design features in the places where they make the most sense. C++ is a very rich language and offers a wide design space to choose from. Templates are one important piece of this design space, but C++ also offers many other great things. -Chris

David Abrahams

9:14 p.m.

on Sat Sep 01 2007, Chris Lattner <clattner-AT-apple.com> wrote:

...

Can you give me a specific example of where making the entire lexer a template would be more useful than adding a couple of virtual method calls? It isn't meaningful to lex anything other than a sequence of char's

You're serious? It's meaningless to lex UTF-32?

...

for example, and efficient buffer management (at least in our context) means that the input to the lexer isn't useful as an iterator interface.

Well, the kind of input sequence is exactly one thing I would templatize.

...

The point of most of the LLVM design decisions is not to hide from templates: it is to use the right C++ design features in the places where they make the most sense. C++ is a very rich language and offers a wide design space to choose from. Templates are one important piece of this design space, but C++ also offers many other great things.

Agreed. -- Dave Abrahams Boost Consulting http://www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com

Vladimir Prus

9:20 p.m.

David Abrahams wrote:

...

on Sat Sep 01 2007, Chris Lattner <clattner-AT-apple.com> wrote:

...
Can you give me a specific example of where making the entire lexer a template would be more useful than adding a couple of virtual method calls? It isn't meaningful to lex anything other than a sequence of char's

You're serious? It's meaningless to lex UTF-32?

Do you expect a modern compiler to include those specialization: - lexer<ascii> - lexer<utf8> - lexer<utf32> - lexer<koi8-r> - .... - lexer<klingon standard encoding> all together? How much that will be, like 1G of code space just for lexer? - Volodya

David Abrahams

2 Sep 2 Sep

6:29 p.m.

on Sat Sep 01 2007, Vladimir Prus <ghost-AT-cs.msu.su> wrote:

...

Do you expect a modern compiler to include those specialization:

- lexer<ascii> - lexer<utf8> - lexer<utf32> - lexer<koi8-r> - .... - lexer<klingon standard encoding>

all together?

No -- Dave Abrahams Boost Consulting http://www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com

Chris Lattner

1:35 a.m.

On Sep 1, 2007, at 2:14 PM, David Abrahams wrote:

...

on Sat Sep 01 2007, Chris Lattner <clattner-AT-apple.com> wrote:

...
Can you give me a specific example of where making the entire lexer a template would be more useful than adding a couple of virtual method calls? It isn't meaningful to lex anything other than a sequence of char's

You're serious? It's meaningless to lex UTF-32?

It's not meaningless, but templatizing the entire lexer (which includes the preprocessor) is not necessarily the best way to achieve this. Consider three possible design points: 1. Templatize the entire lexer and preprocessor on the input encoding. 2. translate UTF-32 (and every other encoding) to UTF-8 in a prepass over the buffer. Lex the resulting buffer as UTF-8. 3. Make the low level "getchar" routine in the lexer do a dynamic switch on the input encoding type, translating the input encoding into the internal encoding on the fly. These three design points all have strengths and weaknesses. For example: #1 is the best solution if your compiler is known to only statically cares about a single encoding - the single instantiation is obviously going to be performance optimal for any specific encoding, and the code size/compile time will be comparable to a non-templated implementation. However, if it has to handle multiple encodings (as is typical for most real compilers), it requires one instantiation of a large amount of code for a large number of encoding types. This is a serious compile time (of the lexer) and code size problem. #2 is the best solution if your compiler almost always reads ASCII or UTF-8, but needs to be compatible with a wide range of encodings. It features low compile time (of the lexer) and is optimal for UTF-8 and compatible encodings. The bad part about it is that non-compatible encodings must have an expensive prepass over their code, which can be a significant performance problem. #3 is the best solution for a compiler that reads a wide variety of encodings and there is no common case. It features low compile time, but obviously has worse performance than #1. One interesting aspect of it (which is often under-appreciated) is that common processors will perfectly predict the encoding-related branches in the inner scanning loop, so the performance impact would be much lower that may be expected. My point isn't to tell you what I think is the optimal solution - I'm just trying to point out that a very important part of software engineering is choosing the right design point for a problem. I love C++ as a language because it gives me many tools to choose from to solve a specific problem such as this. Compiler design in particular doesn't have any "magic": it's all just a series of design decisions based on various engineering tradeoffs. Being aware of and carefully considering all of the options is the best way to make a high-quality product IMO.

...

...
for example, and efficient buffer management (at least in our context) means that the input to the lexer isn't useful as an iterator interface.

Well, the kind of input sequence is exactly one thing I would templatize.

To what benefit? In practice, clang requires its input to come from a nul terminated memory buffer (yes, we do correctly handle embedded nul's in the input buffer as whitespace). Here are the pros and cons: Pros: clang is designed for what we perceive to be the common case. In particular, mmap'ing in files almost always implicitly null terminates the buffer (if a file is not an even multiple of a page size, most major OS's null fill to the end of the page) so we get this invariant for free in most cases. Memory buffers and many others are also easy to handle in this scheme. Futher, knowing that we have a sequential memory buffer as an input makes various optimizations really trivial: for example our block comment skipper is vectorized on hosts that support SSE or Altivec. Having the nul terminator at the end of the file means that the lexer doesn't have to check for "end of buffer" condition in *many* highly performance sensitive lexing loops (e.g. lexing identifiers, which cannot have a nul in them). Cons: for files that are exactly a multiple of the system page size, sometimes you have to do a dynamic allocation and copy the buffer to nul terminate the input. Likewise, if you want to support an arbitrary input iterators then you have to copy the input range into a memory buffer before you start. Personally, I can't think of a situation where lexing C code from an arbitrary input iterator range would be useful. I'll assume you can however: in this case, copying the input into a buffer before you start seems quite likely to give *better* performance than a fully parameterized the lexer would. By restricting the lexer to only being able to assume input iterators, you force it to check for end of stream after it reads *every* character, you effectively prevent vectorization and other optimizations, and you are more at the mercy of the compiler optimizer to produce good code. -Chris

David Abrahams

4:56 p.m.

on Sat Sep 01 2007, Chris Lattner <clattner-AT-apple.com> wrote:

...

On Sep 1, 2007, at 2:14 PM, David Abrahams wrote:

...
on Sat Sep 01 2007, Chris Lattner <clattner-AT-apple.com> wrote:

...
Can you give me a specific example of where making the entire lexer a template would be more useful than adding a couple of virtual method calls? It isn't meaningful to lex anything other than a sequence of char's

You're serious? It's meaningless to lex UTF-32?

It's not meaningless, but templatizing the entire lexer (which includes the preprocessor) is not necessarily the best way to achieve this.

Granted. It was just the first example that popped into my head. My point was simply that, if you're interested in creating a flexible toolkit out of which people can build systems with the highest efficiency, static polymorphism in its interfaces is a virtual (no pun intended) necessity. -- Dave Abrahams Boost Consulting http://www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com

Cory Nelson

5:30 p.m.

On 9/2/07, David Abrahams <dave@boost-consulting.com> wrote:

...

on Sat Sep 01 2007, Chris Lattner <clattner-AT-apple.com> wrote:

...
On Sep 1, 2007, at 2:14 PM, David Abrahams wrote:

...
on Sat Sep 01 2007, Chris Lattner <clattner-AT-apple.com> wrote:

...
Can you give me a specific example of where making the entire lexer a template would be more useful than adding a couple of virtual method calls? It isn't meaningful to lex anything other than a sequence of char's

You're serious? It's meaningless to lex UTF-32?

It's not meaningless, but templatizing the entire lexer (which includes the preprocessor) is not necessarily the best way to achieve this.

Granted. It was just the first example that popped into my head. My point was simply that, if you're interested in creating a flexible toolkit out of which people can build systems with the highest efficiency, static polymorphism in its interfaces is a virtual (no pun intended) necessity.

Seconded. In the end when it comes to libraries choice wins out, and templates can achieve this with no cost to performance, size, or cleanliness. And it's not like you can't make iterators that do buffered virtual transcoding of *->UTF-8 like was discussed if the machine code size of the primary frontend is what you're after. -- Cory Nelson

Chris Lattner

5:43 p.m.

...

...
...
You're serious? It's meaningless to lex UTF-32?

It's not meaningless, but templatizing the entire lexer (which includes the preprocessor) is not necessarily the best way to achieve this.

Granted. It was just the first example that popped into my head. My point was simply that, if you're interested in creating a flexible toolkit out of which people can build systems with the highest efficiency, static polymorphism in its interfaces is a virtual (no pun intended) necessity.

I actually completely agree with you. :) The only (potentially unpopular on this list) issue is that "highest flexibility" and "highest efficiency" are non-goals of LLVM and the clang front-end in general. Instead, we aim for "high flexibility" and "high efficiency", which sometimes means that we make tradeoffs that benefit other pragmatic goals like "reasonable compilation time", "reasonable code size", etc. Again, this does not mean that we avoid templates, and it doesn't mean we have no templated interfaces :). It just means that we will not templatize large amounts of code to get a 0.0001% speedup or to gain flexibility in a theoretical case that no one envisions happening. -Chris

David Abrahams

6:34 p.m.

on Sun Sep 02 2007, Chris Lattner <clattner-AT-apple.com> wrote:

...

...
...
...
You're serious? It's meaningless to lex UTF-32?

It's not meaningless, but templatizing the entire lexer (which includes the preprocessor) is not necessarily the best way to achieve this.

Granted. It was just the first example that popped into my head. My point was simply that, if you're interested in creating a flexible toolkit out of which people can build systems with the highest efficiency, static polymorphism in its interfaces is a virtual (no pun intended) necessity.

I actually completely agree with you. :)

The only (potentially unpopular on this list) issue is that "highest flexibility"

I only said "flexible."

...

and "highest efficiency" are non-goals of LLVM and the clang front-end in general.

Oh, I got a different impression from watching the video.

...

Instead, we aim for "high flexibility" and "high efficiency", which sometimes means that we make tradeoffs that benefit other pragmatic goals like "reasonable compilation time", "reasonable code size", etc.

Of course.

...

Again, this does not mean that we avoid templates, and it doesn't mean we have no templated interfaces :).

Oh, I got a different impression from earlier messages in this thread.

...

It just means that we will not templatize large amounts of code to get a 0.0001% speedup or to gain flexibility in a theoretical case that no one envisions happening.

I wouldn't do that either. -- Dave Abrahams Boost Consulting http://www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com

David Greene

7:04 p.m.

David Abrahams wrote:

...

...
Again, this does not mean that we avoid templates, and it doesn't mean we have no templated interfaces :).

Oh, I got a different impression from earlier messages in this thread.

What Chris says is true. My message was more to the point of seeing if Boost could provide solutions for common problems in llvm. When I brought this up on the llvm list, the feedback was rather strongly negative. My goal here is not to force anything on llvm. Rather, it's to improve the Boost community's ability to clearly articulate the tradeoffs involved in using Boost. Now, in my own experience these tradeoffs have almost always been a win and I believe there are places in llvm where they could be a win as well. Mathias brings up reasonable questions as to the pointer-based interface. It's true that llvm has had more problems with iterators (in fact, I discovered many of them). I believe this is due to a combination of relative inexperience with the standard library but also some inconsistency in how the standard library's proponents present their case. BUT, those problems were easily found BECAUSE the iterator interface is well-encapsulated in a way that allows injection of bug-checking code. BTW, the specific issue that keeps coming up involves taking the address of the first element of an empty std::vector: &a[0]; The C++ community promotes std::vector as a C array replacement, but it's really not because it's not legal to do the above operation when the vector is empty. I believe this is because the vector interface is characterized in terms of iterators rather than addresses. Thus operator[] implies operator* even though the address of the result is all that's desired. -Dave

Bo Persson

8:40 p.m.

David Greene wrote: :: BTW, the specific issue that keeps coming up involves taking the :: address :: of the first element of an empty std::vector: :: :: &a[0]; :: :: The C++ community promotes std::vector as a C array replacement, :: but :: it's really not because it's not legal to do the above operation :: when :: the vector is empty. That's just a misuse of the std::vector. You never, ever get the idea of taking the address of an empty array. Why attempt that on an empty vector?? :: I believe this is because the vector :: interface :: is characterized in terms of iterators rather than addresses. Yes, so a.begin() works even for an empty vector. Why is the address so interesting? Bo Persson

David A. Greene

3 Sep 3 Sep

2:47 a.m.

Bo Persson wrote:

...

:: &a[0]; :: :: The C++ community promotes std::vector as a C array replacement, :: but :: it's really not because it's not legal to do the above operation :: when :: the vector is empty.

That's just a misuse of the std::vector. You never, ever get the idea of taking the address of an empty array. Why attempt that on an empty vector??

Dunno what you mean by "get the idea of taking the address," but this is a perfectly reasonable thing to do in C. In llvm, there are many interfaces that want, for example: foo(int *array, int num) It's perfectly legal to pass (&a[0], 0) to that. It is not legal if a is a std::vector, however. That's why std::vector is not a replacement for a C array. Yet it is touted as such by C++ proponents. We need to be much more precise in our advocacy.

...

:: I believe this is because the vector :: interface :: is characterized in terms of iterators rather than addresses.

Yes, so a.begin() works even for an empty vector. Why is the address so interesting?

Because it's the whole point of my message. The pointer interfaces do cause some problems. I'm slowly converting these to take iterators instead. I have to respectfully disagree with Chris. The problem isn't iterators _per_se_, it's the pointer interface the iterators are used with that causes the problems. Converting to these interfaces to iterators in turn drives a conversion to use templated interfaces and this is something the llvm community has been open to, to its credit. -Dave

Sebastian Redl

10:56 a.m.

David A. Greene wrote:

...

foo(int *array, int num)

It's perfectly legal to pass (&a[0], 0) to that. It is not legal if a is a std::vector, however.

1) Declaring a stack array of zero length is illegal in the first place. (GCC needs the -pedantic compile option to actually fail to compile such code.) int ar[0]; emptyarray.cpp: In function 'int main()': emptyarray.cpp:7: error: ISO C++ forbids zero-size array 'ar' 2) Assuming that it were allowed, what would be the semantics? GCC gives sizeof(ar) == 0 (which is a value that does not occur in standard C++), and (int*)ar as an address on the stack, with some very weird issues: the address is the same as that of the preceding stack variable, yet comparing the two addresses seems to yield false. Good indication that what this thing is doing is ... rather weird. 3) Now, if sizeof(ar) == 0, ar cannot have a legal address - that is, not one that points into the object, because the object has no size. GCC's semantics are that it points at a different object. Thus, dereferencing the pointer is illegal. And the semantics of C++ say that &ar[0] involves dereferencing, whatever the final code does. So it is not perfectly legal to do &ar[0] on an empty array. 4) Given that zero-length arrays are not allowed, complaining about the semantics of &v[0], where v is a zero-length vector, is a strawman attack. If the array form of the code is correct, the situation cannot come up in the code that was transformed to use vectors. 5) If you really need to give it an address, pass the null pointer. Sebastian Redl

Jeff Flinn

1:05 p.m.

David A. Greene wrote:

...

Bo Persson wrote:

...
:: &a[0]; :: :: The C++ community promotes std::vector as a C array replacement, :: but :: it's really not because it's not legal to do the above operation :: when :: the vector is empty.

That's just a misuse of the std::vector. You never, ever get the idea of taking the address of an empty array. Why attempt that on an empty vector??

Dunno what you mean by "get the idea of taking the address," but this is a perfectly reasonable thing to do in C. In llvm, there are many interfaces that want, for example:

foo(int *array, int num)

I think the crux of the issue is why "there are many interfaces" taking raw pointers of any kind? Jeff Flinn

Benoit

2:42 p.m.

...

I think the crux of the issue is why "there are many interfaces" taking raw pointers of any kind?

I agree but i believe the answer is portability across different languages. There is no alternative for arrays unless you want to call that function once for each element in the array, but that could make the whole process much less efficient. What if vector had a member function returning &v[0] when v.size()>0 and 0 otherwise ? But that's another story !

Bo Persson

3:51 p.m.

Benoit wrote: ::: I think the crux of the issue is why "there are many interfaces" ::: taking raw pointers of any kind? :: :: I agree but i believe the answer is portability across different :: languages. There is no alternative for arrays unless you want to :: call that :: function once for each element in the array, but that could make :: the :: whole process much less efficient. The C++ way is of course to use a start,end pair, not start+size. This works in other languages as well. :: :: What if vector had a member function returning &v[0] when :: v.size()>0 :: and 0 otherwise ? It soon will! :-) Bo Persson

David Abrahams

3:52 p.m.

on Sun Sep 02 2007, "David A. Greene" <greened-AT-obbligato.org> wrote:

...

Bo Persson wrote:

...
:: &a[0]; :: :: The C++ community promotes std::vector as a C array replacement, :: but :: it's really not because it's not legal to do the above operation :: when :: the vector is empty.

That's just a misuse of the std::vector. You never, ever get the idea of taking the address of an empty array. Why attempt that on an empty vector??

Dunno what you mean by "get the idea of taking the address," but this is a perfectly reasonable thing to do in C.

Writing x[n] is undefined behavior in C++ whenever n >= the length of the array. Is C different in that regard? -- Dave Abrahams Boost Consulting http://www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com

David Abrahams

2 Sep 2 Sep

6:26 p.m.

on Sat Sep 01 2007, Chris Lattner <clattner-AT-apple.com> wrote:

...

...
...
for example, and efficient buffer management (at least in our context) means that the input to the lexer isn't useful as an iterator interface.

Well, the kind of input sequence is exactly one thing I would templatize.

To what benefit?

So people don't have to pay the price of copying their sequence into a null-terminated memory buffer.

...

In practice, clang requires its input to come from a nul terminated memory buffer (yes, we do correctly handle embedded nul's in the input buffer as whitespace). Here are the pros and cons:

Pros: clang is designed for what we perceive to be the common case. In particular, mmap'ing in files almost always implicitly null terminates the buffer (if a file is not an even multiple of a page size, most major OS's null fill to the end of the page) so we get this invariant for free in most cases. Memory buffers and many others are also easy to handle in this scheme.

Futher, knowing that we have a sequential memory buffer as an input makes various optimizations really trivial: for example our block comment skipper is vectorized on hosts that support SSE or Altivec. Having the nul terminator at the end of the file means that the lexer doesn't have to check for "end of buffer" condition in *many* highly performance sensitive lexing loops (e.g. lexing identifiers, which cannot have a nul in them).

The ability to provide specialized algorithm implementations that take advantage of special knowledge of the data structure is a strength of generic programming. -- Dave Abrahams Boost Consulting http://www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com

Mathias Gaunard

1 Sep 1 Sep

10:23 p.m.

Chris Lattner wrote:

...

David's characterization is somewhat correct, but is also a bit simplistic. The LLVM project does certainly use templates (including partial specialization etc), but we prefer to keep this as an implementation detail in a library. Exposing "complex" template code through the public interface of a library generally make the library "scary" to those developers who don't consider themselves to be C++ gurus. This design point also reduces build time for the LLVM code itself.

From a rough sketch, I personally feel more scared by the pointers exposed everywhere -- which are one of the greatest sources of unsafety in C++ --, the non usage of RAII, the explicit casts and other unsafe things, the usage of downcasting everywhere etc. I had read that LLVM was supposed to be "finely crafted C++" but it looks more like the usual object-oriented C++ from the nineties that you see everywhere and that contributed to make C++'s reputation of a dangerous and unmaintainnable language. I honestly don't understand how can people be "scared" to use templates, especially compiler writers, which should be able to understand how they work and what they do. They don't add bloat unless you make them do so (especially with LLVM which should be able to remove all unused code and duplicates without any problem). On the contrary, they allow the design of thin layers for robust type safety as well as nice reusable wrappers.

Chris Lattner

2 Sep 2 Sep

12:49 a.m.

On Sep 1, 2007, at 3:23 PM, Mathias Gaunard wrote:

...

Chris Lattner wrote:

...
David's characterization is somewhat correct, but is also a bit simplistic. The LLVM project does certainly use templates (including partial specialization etc), but we prefer to keep this as an implementation detail in a library. Exposing "complex" template code through the public interface of a library generally make the library "scary" to those developers who don't consider themselves to be C++ gurus. This design point also reduces build time for the LLVM code itself.

From a rough sketch, I personally feel more scared by the pointers exposed everywhere -- which are one of the greatest sources of unsafety in C++ --, the non usage of RAII, the explicit casts and other unsafe things, the usage of downcasting everywhere etc.

This is an interesting assertion. In LLVM, we have had far more bugs and other problems due to various subtle iterator invalidation issues than due to any sort of pointer or memory lifetime problems. While they may be your personal fear, they seem to work well for a lot of code in practice.

...

I had read that LLVM was supposed to be "finely crafted C++" but it looks more like the usual object-oriented C++ from the nineties that you see everywhere and that contributed to make C++'s reputation of a dangerous and unmaintainnable language.

I'm sure you're aware that "the usual object-oriented C++ from the nineties" often is the best solution when you do have *dynamic* polymorphism - as you do when processing ASTs and other compiler IRs. Templates are great for solving problems involving static polymorphism, and we certainly do apply them to such problems. In any case, this hyperbole is wildly off-topic and not very useful. If you have specific suggestions on improvements that can be made to various LLVM APIs, please do bring it up on the appropriate LLVM lists - we're always very interested in improving the system.

...

I honestly don't understand how can people be "scared" to use templates, especially compiler writers, which should be able to understand how they work and what they do.

You need to talk to C programmers more often then. Not all compiler engineers are C++ front-end hackers after all. I'm not interested in artificially restricting the number of people who can contribute to LLVM. Using templates for the sake of using templates isn't interesting to me: using them where they add value is, and we continue to do so.

...

They don't add bloat unless you make them do so (especially with LLVM which should be able to remove all unused code and duplicates without any problem). On the contrary, they allow the design of thin layers for robust type safety as well as nice reusable wrappers.

Templates are a tool that can be used for good as well as for evil. When misused, templates, like any other powerful tool, can be very damaging to readability, compile time, and code size. When used appropriately, then can be quite beneficial. In any case, this discussion isn't very appropriate to the boost list. If you want to suggest ways to improve LLVM, please do so in the appropriate place: the LLVM list. Thanks! -Chris

David Abrahams

1 Sep 1 Sep

2:41 p.m.

on Fri Aug 31 2007, Doug Gregor <dgregor-AT-osl.iu.edu> wrote:

...

- Video: http://llvm.org/devmtg/2007-05/09-Naroff-CFE.mov

The goal of separating semantic analysis from parsing is a noble one, but it sounds like he may be underestimating the amount of semantic analysis that's required to parse C++. Inside templates we have the benefit of the typename and template keywords to tell us which are the types, but not inside regular code. AFAICT that means it has to do template instantiation just to tell whether foo<bar>::x is a type or not. Am I missing something? -- Dave Abrahams Boost Consulting http://www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com

Douglas Gregor

3:47 p.m.

On Sat, 2007-09-01 at 10:41 -0400, David Abrahams wrote:

...

on Fri Aug 31 2007, Doug Gregor <dgregor-AT-osl.iu.edu> wrote:

...
- Video: http://llvm.org/devmtg/2007-05/09-Naroff-CFE.mov

The goal of separating semantic analysis from parsing is a noble one, but it sounds like he may be underestimating the amount of semantic analysis that's required to parse C++. Inside templates we have the benefit of the typename and template keywords to tell us which are the types, but not inside regular code. AFAICT that means it has to do template instantiation just to tell whether foo<bar>::x is a type or not. Am I missing something?

No, you're technically correct. Some semantic analysis is certainly required to parse C++, so you can't completely drop semantic analysis and still parse. However, you can keep the two notions in separate modules, and the parser will certainly need to call into the semantic analysis module to figure out whether a particular name is a type, a value, a template, etc... just like a C parser needs to consult a symbol table to figure out whether a name is a typedef name or something else. What this probably means is that the "minimal" semantic analysis for C++ is a whole lot more heavyweight than the minimal semantic analysis for C. But you still get some benefit from separating out the semantics from the parser, because there are many semantic bits that you *can* ignore if you only want an (unchecked) parse tree. - Doug

David Abrahams

6:01 p.m.

on Sat Sep 01 2007, Douglas Gregor <doug.gregor-AT-gmail.com> wrote:

...

On Sat, 2007-09-01 at 10:41 -0400, David Abrahams wrote:

...
on Fri Aug 31 2007, Doug Gregor <dgregor-AT-osl.iu.edu> wrote:

...
- Video: http://llvm.org/devmtg/2007-05/09-Naroff-CFE.mov

The goal of separating semantic analysis from parsing is a noble one, but it sounds like he may be underestimating the amount of semantic analysis that's required to parse C++. Inside templates we have the benefit of the typename and template keywords to tell us which are the types, but not inside regular code. AFAICT that means it has to do template instantiation just to tell whether foo<bar>::x is a type or not. Am I missing something?

No, you're technically correct. Some semantic analysis is certainly required to parse C++, so you can't completely drop semantic analysis and still parse.

Isn't "some" a huge understatement? I mean, c'mon, you need to do overload resolution! Just evaluate boost::detail::is_incrementable<X>::value for some X, for example.

...

However, you can keep the two notions in separate modules,

Sure.

...

and the parser will certainly need to call into the semantic analysis module to figure out whether a particular name is a type, a value, a template, etc... just like a C parser needs to consult a symbol table to figure out whether a name is a typedef name or something else.

Yeah, only more so. At one point he said of the parser, "we don't do constant folding," but clearly you need to do that to decide whether a name is a type or not. foo<3*5>::x * y; It seems to me that for C++ with templates, during parsing you have to all the semantic analysis that isn't code generation -- and that's a lot.

...

What this probably means is that the "minimal" semantic analysis for C++ is a whole lot more heavyweight than the minimal semantic analysis for C. But you still get some benefit from separating out the semantics from the parser, because there are many semantic bits that you *can* ignore if you only want an (unchecked) parse tree.

What, other than code generation? It has often seemed to me that it might make sense to parse C++ nondeterministically, just to avoid some of these issues. The number of real instances of ambiguity is probably pretty small. Anyway, I should probably take this over to the LLVM list... -- Dave Abrahams Boost Consulting http://www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com

Hartmut Kaiser

6:31 p.m.

David Abrahams wrote:

...

It has often seemed to me that it might make sense to parse C++ nondeterministically, just to avoid some of these issues. The number of real instances of ambiguity is probably pretty small.

Anyway, I should probably take this over to the LLVM list...

Agreed, and there has been done some work on this already: http://www.computing.surrey.ac.uk/research/dsrg/fog/, which seems to be promising. Parsing here doesn't have to deal with ambiguities (the parser understands some superset of C++). These are resolved later during semantic analysis. Regards Hartmut

Chris Lattner

8:02 p.m.

...

...
No, you're technically correct. Some semantic analysis is certainly required to parse C++, so you can't completely drop semantic analysis and still parse.

Isn't "some" a huge understatement? I mean, c'mon, you need to do overload resolution! Just evaluate boost::detail::is_incrementable<X>::value for some X, for example.

C++ is clearly more complicated than C. The minimal amount of semantic processing for C++ will probably include scoping, namespace, class and function processing (where in C you just need to track typedefs + scoping). However, you don't need to track function bodies and a lot of other things if you don't want to.

...

...
and the parser will certainly need to call into the semantic analysis module to figure out whether a particular name is a type, a value, a template, etc... just like a C parser needs to consult a symbol table to figure out whether a name is a typedef name or something else.

Yeah, only more so. At one point he said of the parser, "we don't do constant folding," but clearly you need to do that to decide whether a name is a type or not.

foo<3*5>::x * y;

It seems to me that for C++ with templates, during parsing you have to all the semantic analysis that isn't code generation -- and that's a lot.

No one is debating that you have to track the right things. For integer constant expressions, you can clearly track the value as you parse, regardless of whether you are building an AST or not. You do have to do minimal amounts of semantic analysis to do this. In C, the closest example are things like "case 1+4/(someenumval):". In the AST, we actually do represent the fully expanded form (which is useful for some clients of the AST) and compute the i-c-e value on demand. As Doug mentioned, the most important point of the design space we are in is to keep the syntax and semantics partitioned from each other. This makes it easier to understand either of the two and enforces a clear and well-defined interface boundary between the two. Having both a minimal semantics implementation and a full AST- building semantics analysis module is more useful as verification that the interfaces are correct than anything else.

...

...
What this probably means is that the "minimal" semantic analysis for C++ is a whole lot more heavyweight than the minimal semantic analysis for C. But you still get some benefit from separating out the semantics from the parser, because there are many semantic bits that you *can* ignore if you only want an (unchecked) parse tree.

What, other than code generation?

It has often seemed to me that it might make sense to parse C++ nondeterministically, just to avoid some of these issues. The number of real instances of ambiguity is probably pretty small.

Without more context I'm not sure what you mean by non- deterministic. Obviously (hopefully) all parsers are deterministic :). There are at least three different ways of parsing C++ fuzzily: 1. Use a doxygen-style "fuzzy" parser with a set of heuristics. This is needed if you want to try to parse files without processing headers, but has obvious significant limitations. 2. Parse and track the "minimal" set of semantic information to parse correctly. This gives you a correct parse tree, and reduces the amount of semantic information you need to keep around (memory use is lower than a full sema implementation), but for C++, you end up doing a lot of stuff anyway. 3. Parse a superset of the language and either resolve the ambiguity later or not. The problem with this is that it is significantly less efficient both in time and space than using semantic information to direct the parser. However, it can provide a nice separation between the parser and semantic analyzer. -Chris

Douglas Gregor

8:55 p.m.

On Sat, 2007-09-01 at 13:02 -0700, Chris Lattner wrote:

...

...
...
No, you're technically correct. Some semantic analysis is certainly required to parse C++, so you can't completely drop semantic analysis and still parse.

Isn't "some" a huge understatement? I mean, c'mon, you need to do overload resolution! Just evaluate boost::detail::is_incrementable<X>::value for some X, for example.

C++ is clearly more complicated than C. The minimal amount of semantic processing for C++ will probably include scoping, namespace, class and function processing (where in C you just need to track typedefs + scoping). However, you don't need to track function bodies and a lot of other things if you don't want to.

As Dave noted, it also includes template instantiation and overload resolution. It's a phenomenal amount of work to write a full C++ parser, because you need nearly everything that a compiler needs. Once you have that, "minimal" semantic analysis can still be very useful. That minimal analysis still includes most of the capabilities of a compiler (yes, template instantiation and overloading have to be there to be 100% correct), but it can still avoid instantiations of function templates, instantiations of class templates without specializations, code generation, and much of the other semantic analysis tasks. So while an AST-producing C++ parser won't have much less code than a full C++ compiler, it will execute far less of that code. You need template instantiation and overload resolution, but only in very limited cases.

...

As Doug mentioned, the most important point of the design space we are in is to keep the syntax and semantics partitioned from each other. This makes it easier to understand either of the two and enforces a clear and well-defined interface boundary between the two. Having both a minimal semantics implementation and a full AST- building semantics analysis module is more useful as verification that the interfaces are correct than anything else.

It's also extremely useful for anyone who wants to manipulate the ASTs. The reason GCC is so darned hard to work with (aside from the crusty C code and ambiguous data structures) is that there is no separate API for manipulating the AST. The parsing is intertwined with the semantic analysis, so if you want to go through and build a new tree *without* parsing code for that tree, things can get ugly. - Doug

David Abrahams

9:17 p.m.

on Sat Sep 01 2007, Douglas Gregor <doug.gregor-AT-gmail.com> wrote:

...

while an AST-producing C++ parser won't have much less code than a full C++ compiler, it will execute far less of that code. You need template instantiation and overload resolution, but only in very limited cases.

Seems to me that parsing almost any non-templated code that uses a class template for anything will need template instantiation. What am I missing? -- Dave Abrahams Boost Consulting http://www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com

Douglas Gregor

9:53 p.m.

On Sat, 2007-09-01 at 17:17 -0400, David Abrahams wrote:

...

on Sat Sep 01 2007, Douglas Gregor <doug.gregor-AT-gmail.com> wrote:

...
while an AST-producing C++ parser won't have much less code than a full C++ compiler, it will execute far less of that code. You need template instantiation and overload resolution, but only in very limited cases.

Seems to me that parsing almost any non-templated code that uses a class template for anything will need template instantiation. What am I missing?

Yes, class templates used in non-templated code will often have to be instantiated. But the member functions of those class templates don't need to be instantiated, nor do the bodies of other function templates used from non-template code. Since most templates that get instantiated are instantiated from other templates, not having to instantiate function bodies can save a lot of time. - Doug

Andrew Sutton

8:11 p.m.

...

Clang has the potential to be a top-notch, open-source C++ compiler and tool platform, from which we would all benefit. Let's help Clang get there sooner!

Links: Clang: http://clang.llvm.org/ LLVM: http://llvm.org/ Steve Naroff's talk motivating Clang: - Video: http://llvm.org/devmtg/2007-05/09-Naroff-CFE.mov - Slides: http://llvm.org/devmtg/2007-05/09-Naroff-CFE.pdf Clang announcement: http://lists.cs.uiuc.edu/pipermail/llvmdev/2007- July/009817.html

Cool... I've been wanting something like this for a while. Out of curiosity - and this probably isn't the best place to ask - but I noticed one of their stated goals is to support source code engineering tasks like refactoring, etc. The problem is that compilers aren't necessarily that good at that type of work in that they require _correct_ and probably preprocessed source code. Is there any indication on whether or not this project will support partial and incomplete parsing? Andrew Sutton asutton@cs.kent.edu

Vladimir Prus

8:46 p.m.

Andrew Sutton wrote:

...

...
Clang has the potential to be a top-notch, open-source C++ compiler and tool platform, from which we would all benefit. Let's help Clang get there sooner!

Links: Clang: http://clang.llvm.org/ LLVM: http://llvm.org/ Steve Naroff's talk motivating Clang: - Video: http://llvm.org/devmtg/2007-05/09-Naroff-CFE.mov - Slides: http://llvm.org/devmtg/2007-05/09-Naroff-CFE.pdf Clang announcement: http://lists.cs.uiuc.edu/pipermail/llvmdev/2007- July/009817.html

Cool... I've been wanting something like this for a while. Out of curiosity - and this probably isn't the best place to ask - but I noticed one of their stated goals is to support source code engineering tasks like refactoring, etc. The problem is that compilers aren't necessarily that good at that type of work in that they require _correct_ and probably preprocessed source code. Is there any indication on whether or not this project will support partial and incomplete parsing?

Hmm, I would expect complete and build source before I try any refactoring -- so that I can run tests before and after and veryfy nothing broke. I believe Eclipse requires all files to be saved before refactoring (this is in Java) - Volodya

Andrew Sutton

9:04 p.m.

...

...
Cool... I've been wanting something like this for a while. Out of curiosity - and this probably isn't the best place to ask - but I noticed one of their stated goals is to support source code engineering tasks like refactoring, etc. The problem is that compilers aren't necessarily that good at that type of work in that they require _correct_ and probably preprocessed source code. Is there any indication on whether or not this project will support partial and incomplete parsing?

Hmm, I would expect complete and build source before I try any refactoring -- so that I can run tests before and after and veryfy nothing broke. I believe Eclipse requires all files to be saved before refactoring (this is in Java)

Not necessarily so. Refactorings are essentially transformations on the structured text, and most can be done regardless of whether or not the code is actually in a compilable state. Others require information beyond what the compiler can give (e.g., propagating the change of a class name to files outside the current translation unit). This also implies that you don't necessarily have to run the preprocessor before trying to run a refactoring. Other code-related tasks don't even need to parse the entire the program, but just need to extract the "superstructure" of some code (think UML diagram). Andrew Sutton asutton@cs.kent.edu

Chris Lattner

2 Sep 2 Sep

12:59 a.m.

On Sep 1, 2007, at 1:11 PM, Andrew Sutton wrote:

...

...
Clang has the potential to be a top-notch, open-source C++ compiler and tool platform, from which we would all benefit. Let's help Clang get there sooner!

Links: Clang: http://clang.llvm.org/ LLVM: http://llvm.org/ Steve Naroff's talk motivating Clang: - Video: http://llvm.org/devmtg/2007-05/09-Naroff-CFE.mov - Slides: http://llvm.org/devmtg/2007-05/09-Naroff-CFE.pdf Clang announcement: http://lists.cs.uiuc.edu/pipermail/llvmdev/2007- July/009817.html

Cool... I've been wanting something like this for a while. Out of curiosity - and this probably isn't the best place to ask - but I noticed one of their stated goals is to support source code engineering tasks like refactoring, etc. The problem is that compilers aren't necessarily that good at that type of work in that they require _correct_ and probably preprocessed source code. Is there any indication on whether or not this project will support partial and incomplete parsing?

I agree with Vladimir: my definition refactoring is as a behavior- preserving transformation from one valid program to another. Given this, you really do need much of a compiler, but you also want highly accurate source location information, information about macro expansions, etc which compilers typically don't keep. It's an explicit goal for us to preserve that information and our solutions seem to work well for us so far. Other source transformation techniques of various kinds don't need fully correct ASTs to perform their job. While we haven't done it so far, I expect our toolkit to grow support for fuzzy parsing (of the sort that doxygen applies) which can be useful to a wide variety of clients. One of the nice aspects of our design is that the decomposition of the front-end into logical pieces allows those individual pieces to be reused without requiring the whole thing to be used. Supporting different parsing techniques in the same framework allows clients to choose the parsing technique that is most appropriate for their application. In this way, using our libraries doesn't force one set of design tradeoffs onto the client. -Chris

Andrew Sutton

1:32 a.m.

...

I agree with Vladimir: my definition refactoring is as a behavior- preserving transformation from one valid program to another. Given this, you really do need much of a compiler, but you also want highly accurate source location information, information about macro expansions, etc which compilers typically don't keep. It's an explicit goal for us to preserve that information and our solutions seem to work well for us so far.

To start with, your definition of refactoring is incorrect - it simply preserves the external behavior of a program. I would also point out that in order to ensure that a program is correct it first has to be preprocessed, and parsed. Something as simple as renaming a function - which is a well-known refactoring - requires none of that. The complexity of the refactoring determines the amount of information needed - whether or not you actually need a fully correct AST all the time - I doubt it. Andrew Sutton asutton@cs.kent.edu

Chris Lattner

1:38 a.m.

On Sep 1, 2007, at 6:32 PM, Andrew Sutton wrote:

...

...
I agree with Vladimir: my definition refactoring is as a behavior- preserving transformation from one valid program to another. Given this, you really do need much of a compiler, but you also want highly accurate source location information, information about macro expansions, etc which compilers typically don't keep. It's an explicit goal for us to preserve that information and our solutions seem to work well for us so far.

To start with, your definition of refactoring is incorrect - it simply preserves the external behavior of a program.

I carefully said it was "my definition" because I'm aware there are multiple different interpretations. Why do you consider your definition to be the 'right' one? :)

...

I would also point out that in order to ensure that a program is correct it first has to be preprocessed, and parsed. Something as simple as renaming a function - which is a well-known refactoring - requires none of that.

Actually, you're wrong. C and C++ require token pasting and escaped newline splicing to happen - and they certainly can occur in identifiers. That is, unless you're willing to break some correct code, various hacks like using sed can sometimes work... be careful of scoping issues though, particularly when macros can expand into {'s :)

...

The complexity of the refactoring determines the amount of information needed - whether or not you actually need a fully correct AST all the time - I doubt it.

Certainly, it obviously depends on the transformation. -Chris

Andrew Sutton

12:14 p.m.

...

I carefully said it was "my definition" because I'm aware there are multiple different interpretations. Why do you consider your definition to be the 'right' one? :)

The definition that commonly occurs in refactoring literature: preserving exterior behavior (or was it external, or outward facing?) I don't remember exactly. It amounts to any transformation that preserves the successful execution of a test program.

...

...
I would also point out that in order to ensure that a program is correct it first has to be preprocessed, and parsed. Something as simple as renaming a function - which is a well-known refactoring - requires none of that.

Actually, you're wrong. C and C++ require token pasting and escaped newline splicing to happen - and they certainly can occur in identifiers. That is, unless you're willing to break some correct code, various hacks like using sed can sometimes work... be careful of scoping issues though, particularly when macros can expand into {'s :)

Yeah, but you're still talking about transformations on structured text. Having a lexically correct program is a pretty far cry from actually an actually correct program - which I agree is important. You may also have the cases where you may want to operate directly on macros without expansion - or on header inclusions without inclusion.

...

...
The complexity of the refactoring determines the amount of information needed - whether or not you actually need a fully correct AST all the time - I doubt it.

Certainly, it obviously depends on the transformation.

Software engineering research literature is actually a good place to go to see how people are trying to deal with these problems (that is if you can still find people trying to work with C++ - most researchers prefer Java these days). One of the conclusions from all this work is that there's a distinct difference between a compiler and what sometimes gets called a reverse engineering parser. There's a tradeoff between correctness and robustness in their ability to work with more code and under different conditions - like in an editor, or in absence of a correct build. I guess I'm trying to say that there are a broad class of operations on source code that require a holistic view of the text rather than a fully preprocessed and lexical view - and the ability to build a partial AST on top of it. I think it will be interesting to see if llvm/clang is capable of addressing the two different approaches to source code analysis. Andrew Sutton asutton@cs.kent.edu

Chris Lattner

5:32 p.m.

On Sep 2, 2007, at 5:14 AM, Andrew Sutton wrote:

...

I guess I'm trying to say that there are a broad class of operations on source code that require a holistic view of the text rather than a fully preprocessed and lexical view - and the ability to build a partial AST on top of it. I think it will be interesting to see if llvm/clang is capable of addressing the two different approaches to source code analysis.

That's interesting! This sounds like a broad subfield of program transformations that I (obviously) know almost nothing about. :) My guess is that clang will provide a significant amount of useful infrastructure, but does not directly help people interested in these things yet. If you or anyone else was interested in working on this and pushing it forward, we would welcome it. -Chris

Stefan Seefeld

3 Sep 3 Sep

2:56 a.m.

Andrew Sutton wrote:

...

To start with, your definition of refactoring is incorrect - it simply preserves the external behavior of a program. I would also point out that in order to ensure that a program is correct it first has to be preprocessed, and parsed. Something as simple as renaming a function - which is a well-known refactoring - requires none of that.

Hmm, I wouldn't consider renaming a function a simple transformation, really. Consider overloading. You have to do full overload resolution (one of the most complex things in C++) in order to figure out whether a name in a given function call needs to be changed or not. Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Andrew Sutton

11:47 a.m.

...

Hmm, I wouldn't consider renaming a function a simple transformation, really. Consider overloading. You have to do full overload resolution (one of the most complex things in C++) in order to figure out whether a name in a given function call needs to be changed or not.

Renaming the function is simple. Ensuring that you're avoiding name conflicts can be more difficult. Propagating the change requires a lot more work. It's also requires knowledge beyond what a compiler can give you since changes may be required in multiple translation units. Andrew Sutton asutton@cs.kent.edu

Chris Lattner

4:48 p.m.

On Sep 3, 2007, at 4:47 AM, Andrew Sutton wrote:

...

...
Hmm, I wouldn't consider renaming a function a simple transformation, really. Consider overloading. You have to do full overload resolution (one of the most complex things in C++) in order to figure out whether a name in a given function call needs to be changed or not.

Renaming the function is simple. Ensuring that you're avoiding name conflicts can be more difficult. Propagating the change requires a lot more work. It's also requires knowledge beyond what a compiler can give you since changes may be required in multiple translation units.

The standard definitions of Refactoring typically does include verifying and ensuring that renames a) properly obey scope and other language rules and b) don't conflict with other names in any other scope. Refactoring does require processing multiple translation units in most cases, I'm not sure why you'd assume it doesn't. Note specifically that 'sed' does not qualify as a refactoring tool - you have to do a lot more semantic analysis of the program to do refactoring safely. -Chris

Andrew Sutton

10:41 p.m.

...

The standard definitions of Refactoring typically does include verifying and ensuring that renames a) properly obey scope and other language rules and b) don't conflict with other names in any other scope. Refactoring does require processing multiple translation units in most cases, I'm not sure why you'd assume it doesn't.

From the definition... from a practical standpoint its easier to allow the changes and let the compiler catch the errors :) I'm fairly certain that there aren't any implementations that actually make guarantees implied in the definition (I'm 95% sure of research prototypes and 50% for industry implementations). Partially automated refactorings are better than no refactorings. I equate it to the study of deadlock detection and prevention in operating systems. It's great in theory, but its just not practical.

...

Note specifically that 'sed' does not qualify as a refactoring tool - you have to do a lot more semantic analysis of the program to do refactoring safely.

True, neither does grep qualify as a source code analysis tool, but damn it's useful for searching thru text. Hopefully, I'll have some time this semester to look a lot more closely at llvm/clang. I have a feeling that it would still provide a great platform for ad-hoc analysis and transformation. Andrew Sutton asutton@cs.kent.edu

Jean-Christophe Roux

2 Sep 2 Sep

4:47 p.m.

Doug Gregor wrote:

...

Hello fellow Boosters,

...

Apple has begun development of "Clang", which aims to be a high- quality C/C++ parser and compiler. Clang is designed as a library for C++ tools, from IDE-supported tools like indexing, searching, and refactoring to documentation tools, static checkers, and a full C++ front end to the Low Level Virtual Machine (LLVM). LLVM is an advanced, open-source compiler infrastructure supporting all of the back-end functionality needed for a C++ compiler. Like LLVM, Clang is written in a modern C++ style, with a clean separation between the different phases of translation in C++ (lexical analysis, parsing, semantic analysis, lowering, code generation, etc.), all of which make Clang a pleasure to work with.

How cross-platform is this project? It is fair to assume that the Mac is a prime target. What about windows, linux...? Thanks

Chris Lattner

5:36 p.m.

On Sep 2, 2007, at 9:47 AM, Jean-Christophe Roux wrote:

...

Doug Gregor wrote:

...
Hello fellow Boosters,

...
Apple has begun development of "Clang", which aims to be a high- quality C/C++ parser and compiler. Clang is designed as a library for C++ tools, from IDE-supported tools like indexing, searching, and

How cross-platform is this project? It is fair to assume that the Mac is a prime target. What about windows, linux...?

Like the LLVM optimizer and code generator, we aim for it to be widely portable, including windows and linux. However, most developers on clang currently use mac's, so there may be bugs on other platforms that haven't been noticed yet. For example, Hartmut recently discovered that we were opening files in ascii mode instead of binary mode on windows, which caused some testsuite failures. -Chris

Sohail Somani

7:13 p.m.

Well it atleast compiles on Linux :-) -----Original Message----- From: boost-bounces@lists.boost.org on behalf of Jean-Christophe Roux Sent: Sun 9/2/2007 9:47 AM To: boost@lists.boost.org Subject: Re: [boost] Clang: Open-source C/C++ front end under development How cross-platform is this project? It is fair to assume that the Mac is a prime target. What about windows, linux...? Thanks

6490

Age (days ago)

6493

Last active (days ago)

List overview

Download

59 comments

21 participants

participants (21)

Andrew Sutton
Benoit
Bo Persson
Chris Lattner
Cory Nelson
David A. Greene
David Abrahams
David Greene
Doug Gregor
Douglas Gregor
Fei Liu
Hartmut Kaiser
Jean-Christophe Roux
Jeff Flinn
Larry Evans
Mathias Gaunard
Sebastian Redl
Sohail Somani
Stefan Seefeld
Stephen Torri
Vladimir Prus