I didn't say FPGA's were essentially CPU's, just that they have alot in common.
I very strongly disagree with that.
An FPGA and a CPU are more similar than you suggest (you specify 0% similarity). How similar would depend on what CPU, or GPGPU etc that you pick.
OK, pick the most similar FPGA and CPU pair out there, and describe the details that make them so similar.
They both take a program, although VHDL/Verilog have more in common with with functional programming languages than older langues like C.
An FPGA does NOT take a program. The bitstream data is essentially a netlist. A software executable is absolutely not essentially a netlist.
In Verilog:
assign mux_out = mux_select ? input_1 : input 0;
Is one way to write a multiplexor gate. This is the same as connecting three wires to a mux symbol in a schematic drawing. no difference whatsoever, and the synthesis tool will give you a mux gate from teh cell library, just like it would from a schematic. This verilog code needs about four transistors, two input wires, and one output wire to exist in the real world.
In C:
mux_out = mux_select ? input_1 : input_0;
let's say I compiled that on my XE, which is a 32bit CPU. mux_out, mux_select, input_1, and input_0 a re each a 32bit value. If they are all in CPU registers, that is 32 * 4 = 128 flipflops, and each flipflop is maybe a dozen transistors. Then we need a 32bit comparator to choose which of the inputs goes into our result, and a 32 2:1 muxes (can be done in transistors each) to bring the correct input value back to the result register. We need an opcode to tell the thing that we want to do a comparison, we need opcodes to load in the instruction opcode and to load in the input_1 and input_0 values, we need a decoder to determine that the instruction opcode we fetched says to do a comparison, we need a clock to drive the state machines and to latch values into the register flipflops.
Call me goofy if you like, but I do find it difficult to equate those two situations to each other.
Think http://en.wikipedia.org/wiki/Concurrent_computing with a couple of million cpu's.
That implies that there is a couple of million processing entities, each continuously running through sequences of commands telling it what to do, after the program is loaded. Each proccessing entity is doing fetches, decodes, ALUs, register reads and writes, etc. In an FPGA, after you load the design, the stuff that makes up the FPGA itself goes static, it is fixed until you turn it off. There are no fetches, decodes, ALUs, registers being read and written to, etc. (Reconfiguration while the system is running is a significant operation, takes significant time, and really should only be used to INFREQUENTLY swap between very different modes that do not have any dependencies on each other, and it really is a recent phenomena. FPGAs that could do it long ago weren't used that way, or very very rarely.)
It's not really possible to do a 1:1 mapping of a custom chip to VHDL. You're going to end up doing some things slightly differently.
The guy that made the custom chip can, and most likely already did, as that's how you design chips these days in the first place. We use Verilog where I work, but same thing. If someone were to make a TG68 ASIC, the VHDL we already know about is synthesized into a gates netlist, and the ASIC place/route tool uses silicon cells from a library and puts them together to draw the die. It's pretty uncommon to design digital chips any other way than VHDL or Verilog today. Schematics are antiquated for digital, though they are still relevant for analog silicon. There's analog extensions to VHDL and Verilog though, and they'll take over at some point.
You can also license the VHDL or Verilog. That's what ARM does for example. They license their HDL code. TI, ST, Atmel, and all the others are using the same HDL code written by ARM. They add different peripherals to it, choose the fab tech they want to build it in, to try and make their product uniquely useful for a certain market, but a CoretexM4 is a CoretexM4 is a CoretexM4... AppliedMciro licenses the PowerPC core from IBM, that's IBM's VHDL code (or Verilog, whichever is the case). PA Semi did make their own new implementation of the PowerPC instruction set, and had no intention of trying to clone any existing PPC processor core, they worked from the instruction set spec with a different philosophy (minimize power consumption at the target performance level, compared to IBM's really freakin hot G5 design), the same doc an assembly programmer would learn assembly language programming from. Is the X1000 a PowerPC computer, or a simulation or emulation of a PowerPC computer?