From Words to Wires: Generating Functioning Electronic Devices from Natural Language Descriptions

In this work, we show that contemporary language models have a previously unknown skill -- the capacity for electronic circuit design from high-level textual descriptions, akin to code generation. We introduce two benchmarks: Pins100, assessing model knowledge of electrical components, and Micro25, evaluating a model's capability to design common microcontroller circuits and code in the Arduino ecosystem that involve input, output, sensors, motors, protocols, and logic -- with models such as GPT-4 and Claude-V1 achieving between 60% to 96% Pass@1 on generating full devices. We include six case studies of using language models as a design assistant for moderately complex devices, such as a radiation-powered random number generator, an emoji keyboard, a visible spectrometer, and several assistive devices, while offering a qualitative analysis performance, outlining evaluation challenges, and suggesting areas of development to improve complex circuit design and practical utility. With this work, we aim to spur research at the juncture of natural language processing and electronic design.


Introduction
The realm of science fiction often presents us with captivating visions of technology's future.A case in point is the replicator from the TV series Star Trek, a machine capable of creating various physical objects -from food and medicine to functioning devices -based solely on a user's high-level description of those objects.Contemporary language models hint at the precursors to some of this capacity for design, including the ability to design novel 2D and 3D object models (Ramesh et al., 2022;Nichol et al., 2022), predict molecular structures for drug discovery (Liu et al., 2021; 1 PYTHON library and data: https://github.com/cognitiveailab/words2wires 2 Companion video: https://youtu.be/PZ1rr0dDAPI

Large Language Model
"Design a random number generator that uses ambient radiation level to set the random seed, and displays random numbers on a nixie tube." Figure 1: An example of using a language model to convert a high-level textual description of an electronic device into the designs for that device.Those designs are then reviewed by a domain expert, and any errors corrected, before the device is physically manufactured using rapid-prototyping techniques.
Flam-Shepherd and Aspuru-Guzik, 2023), and generate increasingly complex sections of code that power a variety of user applications (Li et al., 2023;Wang et al., 2023).
In this work, we show that language models have a previously unknown skill -the capacity to generate working electronic devices from high-level textual descriptions -effectively bridging the gap between the words of a device description to the wires of a device design.The design process for electronic devices, such as the random number generator in Figure 1, typically follows a stagelike process illustrated in Figure 2.These steps include: ideation, electronic design (including generating parts lists, electronic schematics, and code for embedded processors called microcontrollers), followed ultimately by physical implementation via manufacturing.Our focus in this work is on automating the task of transforming a high-level concept into a practical electronics schematic, complete with companion microcontroller code-a task that currently requires significant human expertise and effort.We evaluate our approach by constructing these devices either in simulators or, for six open-ended case studies, as physical devices implemented using rapid prototyping techniques such as breadboards, circuit board manufacture, laser cutting, and 3D printing.

Bill of Materials
The contributions in this work are: 1. We empirically demonstrate the novel capacity for language models to design electronic devices from high-level textual descriptions, and introduce two benchmarks to measure this capacity.The first, PINS100, measures knowledge of 100 common electronic components.
The second, MICRO25, measures the ability to design 25 common electronic devices from start to finish, including the generation of billsof-materials, pinouts, schematics, and code.
2. Our experimental evaluation shows that both GPT-4 and CLAUDE-V1 have moderate-tostrong performance on these benchmarks, with GPT-4 achieving 96% PASS@1 at generating correct schematics and functioning code on MICRO25, while CLAUDE-V1 scores 60% on schematics and 76% on code.
3. We present six real-world case studies of using language models to construct novel devices, including an emoji keyboard, a visible light spectrometer, and two assistive devices.
Alongside, we provide a qualitative assessment of the strengths of current language models for electronic device design, while also outlining the challenges in enhancing performance, automating evaluation, and increasing their practical utility.

Related Work
Arduino microcontroller ecosystem: This work focuses on circuits that are controlled by microcontrollers, which are processors intended for embedded applications.In particular, we focus on microcontrollers supported by the ARDUINO ecosystem (Banzi and Shiloh, 2022), a popular cross-platform set of C++ tools and libraries3 intended to promote learning and lower the barrier to entry for creating physical devices.Most of the devices described in this work target the ARDUINO UNO, a platform with more than 10 million units sold as of 2023, which uses an ATMEGA328P microcontroller with 2K of RAM, 32K of program space, 20 input/output (I/O) pins, and a speed of 16MHz.While these specifications (e.g.RAM and speed) are modest for desktop computers, they are typical of microcontrollers and embedded applications where programs are typically run bare-metal (i.e.without an operating system).
Code Generation: Though aspects of physical design (such as circuit board layout) have been automated for decades (see Huang et al. (2021) for review), to the best of our knowledge, this is the first work to use language models for automating early-stage circuit design, converting high-level textual device descriptions into initial electronic schematics and microcontroller code.Designing code-driven electronic circuits is similar to code generation tasks (e.g.Chen et al., 2021;Austin et al., 2021;Hendrycks et al., 2021;Lai et al., 2022;Nijkamp et al., 2023, inter alia), with the additional requirement that a model must jointly generate a corresponding electrical circuit schematic that is compatible with the code, that together allow the microcontroller to accomplish a given task.[ [{"name":"SW1", "pin":"1"}, {"name": "uno", "pin":"D3"}],

Microcontroller Code
// LED controlled by button // This code interfaces with a circuit that has a single push button and an LED.// The LED turns on when the button is pressed, and turns o when the button is not pressed."Create a device that turns on an LED when a button is pressed" Figure 3: An example of representing a device specification as text (in formatted JSON) such that it can be generated by a language model, for a trivial device that illuminates an LED in response to a button being pressed.The device specification includes the bill of materials, component pinouts, schematic (represented as a netlist), and microcontroller code.The specification can then be used to create that device, either in simulation or through physical construction.The device shown here was generated by GPT-4, and edited slightly for space.

#de ne
as UML diagrams) can increase task performance (Liu et al., 2022b), and here we adapt this to the electronics context by generating device specifications (such as schematics) immediately before code generation, to condition code generation on a specific device.Similarly, generating structured representations (in the form of code) can better elicit the knowledge in a language model (Madaan et al., 2022), where here we adapt this to generating device specifications as highly-structured JSON representations.These device specifications (bills of materials, pinouts, schematics, and code) are thousands of tokens long, and their generation is enabled by the increase in model context lengths to 8K tokens, allowing models to generate hundreds of lines of code (OpenAI, 2023;Li et al., 2023) compared to earlier models with smaller generation capacity (Chen et al., 2021;Li et al., 2022;Fried et al., 2023).
Hardware Description Languages: A contemporaneous body of work describes code generation tasks for hardware description languages (such as VHDL or VERILOG) that run on specialized processors called Field-Programmable Gate Arrays (FPGAS) -typically to prototype specialized CPU designs, or accelerate digital signal processing.Thakur et al. (2023) introduce a benchmark of 17 simple bitwise tasks, with the most advanced tasks including bitwise addition or counting, and show the best performing model (CODEGEN-6B) to achieve 60% PASS@10 when solving these tasks.Similarly, Blocklove et al. (2023) use GPT-4 as a Verilog coding assistant, demonstrating that it can perform well at generating 8 simple Verilog designs (such a bitwise adder, or 3-state finite state machine), while also providing a qualitative evaluation of using GPT-4 as an assistant to design an 8-bit accumulator-based microprocessor.In contrast, where these projects generate code that runs on FPGAs, this work (WORDS2WIRES) generates both electrical schematics and code for electrical devices that are built from -and interface withreal electrical components, such as sensors, motors, and displays, while also providing a larger-scale set of benchmarks that perform significantly more complex and real-world tasks.
Single-shot vs Collaborative: In this work, we investigate device generation in two contexts.First, we assess single-turn generation of error-free devices using the MICRO25 benchmark, a set of electronic design tasks evaluated using code metrics similar to PASS@1 (Kulal et al., 2019) that require generating a single correct solution, with strict binary measures of task success.In the second context, we explore a collaborative coding-assistant setting, where prompts can be iteratively refined, and any errors in the schematics or code corrected by the end user.This approach is akin to tools such as GITHUB COPILOT (Chen et al., 2021), which assist in generating short parts of programs.However, in our case, the model is utilized to generate an initial version of the entire project, after which the user corrects any errors before physically constructing the device.

Experiment 1: Component Knowledge
To design electronic circuits, the designer needs a knowledge of the individual electrical components that can be used to build a circuit.One of the most fundamental aspects of this knowledge is the component pinouts, or the specific function of each electrical terminal (or pin) on a component.For example, a light-emitting diode (LED) typically has two pins, one an anode where positive voltage is applied, and one a cathode where the negative terminal or ground is applied.Here we measure large language models' knowledge of component pinouts by asking them to generate pinouts for a large number of common electrical components.2), but frequently have dozens and occasionally hundreds of unique pins.Connecting component pins incorrectly in a schematic will cause a device to malfunction, so this knowledge is critical to constructing working circuits.
Benchmark: We assembled a benchmark of electronic component pinouts, PINS100, containing 100 common parts frequently used in circuits found on high-traffic electronic tutorial websites such as the ARDUINO PROJECT HUB and AU-TODESK TINKERCAD CIRCUITS.Components range from 2 to 40 pins, and span a large assortment of part categories including passives (e.g.resistors/capacitors), input (e.g.switches), output (e.g.LEDs, motors, relays), sensors, integrated circuits, power regulators, logic (e.g.7400-SERIES AND and OR gates), and microcontrollers (e.g AR-DUINO, RASPBERRY PI).Models: We evaluate on instruction-tuned models including OpenAI's CHATGPT (GPT-3.5-TURBO)and GPT-4 (OpenAI, 2023), and Anthropic CLAUDE-V1 4 .Model prompts are identi-cal across models, and include a static 1-shot exemplar (a 14-pin 7400-SERIES logic integrated circuit) that provides an example of the pinout task, as well as the requested JSON output format.Additional hyperparameters, evaluation details, and the full prompt are provided in the APPENDIX.
Evaluation: We evaluate using two binary measures of accuracy analogous to the PASS@1 codegeneration metric (Kulal et al., 2019).The first scoring method, strict, requires a given model to output all of a component's pins correctly to be considered correct, otherwise it will be considered incorrect.The second method, permissive, requires only the function-critical pins of a component to be present to be considered correct, while failing to include non-critical pins still counts as success.5 Results: Model performance in the pinout generation task is shown in Table 1.Performance reflects average binary PASS@1 performance of a given model on generating accurate pinouts -for example, a score of 50% reflects that 50% of the components had completely correct pinouts.Here, GPT-4 achieves the highest strict scoring performance, generating accurate pinouts for 74% of components, while both GPT3.5 and CLAUDE-V1 achieve similar levels of performance, generating correct pinouts for 55% and 56% of components, respectively.Permissive scoring increases performance, with the best-scoring GPT-4 model capable of generating pinouts that include the most critical pins for 86% of electrical components in the benchmark.Taken together, these results suggest that large language models have a moderate-to-strong knowledge of electrical component pinouts, a core requirement for designing functioning electronic circuits.

Experiment 2: Circuit Generation
How well can contemporary language models leverage their component knowledge to design simple but functioning electronic devices?In this experiment, we investigate end-to-end generation of working devices, which includes generating four core elements: (1) a bill of materials (BOM), or list of components in the device, (2) the pinouts for each component, (3) a complete electrical circuit diagram called a schematic that details how the components are to be connected, and (4) the code to be programmed onto a microcontroller -a lightweight processor that controls embedded circuits.
An example of a complete model-generated design for a trivial device that turns on an LED in response to a button being pressed is shown in Figure 3.
Benchmark: To assess a model's ability to create microcontroller-driven electronic devices, we developed a benchmark, MICRO25, that includes 25 tasks intended for the common ARDUINO microcontroller ecosystem..These tasks, shown in Table 2, span 5 core categories including: input, interface protocols, output, sensors, and logic.Each task is either tailored to test a specific fundamental competency required to build basic microcontrollerdriven electronic devices common in introductory microcontroller curricula, or the integration of several competencies into larger design flows.6 Representations: Models were given format prompts to export all generated elements (bill of materials, pinouts, schematics, and code) in an annotated JSON format, shown in Figure 3.The annotated format allows the model to add comments for each generated element (e.g.specifying the uses of each component, or the purpose of each connection in the schematic), analogous to chain-of-thought reasoning (Wei et al., 2022) applied to circuit generation, as well as code generation from requirements specifications (Liu et al., 2022b), where here the requirements are the schematics and other device specifications generated immediately preceding the code.Additional details of this representation format are provided in APPENDIX B.4 and C.

Models:
We evaluate on instruction-tuned models with large (8k token) context windows, including OpenAI's GPT-4 (OpenAI, 2023) and Anthropic CLAUDE-V1.Prompts are identical across both models, and include a static minimal 1-shot example of generating each of the 4 elements of a device specification (bill of materials, pinouts, schematic, code) in the desired JSON output format.In response to specific types of errors identified during pilot studies, the prompt also includes three incomplete snippets that provide portions of two positive and one negative generation example.After initial generation, the models are given a reflection   prompt containing 12 common errors (such as correctly supplying power to each component, explicitly enumerating each connection in the schematic, and having code that functions as intended), and allowed to iteratively reflect and improve output until providing a specific stop token signifying that the model has detected no further errors.The initial prompt requires 1884 tokens, and the reflection prompt requires 431 tokens.Additional model details, including evaluation details and the full prompt are provided in the APPENDIX.
Evaluation: Generated devices are broken down into electrical (schematic) and code components, each of which is separately evaluated using a binary PASS@1 metric (i.e.functional or non-functional).
Because the schematic subsumes the bill of materials and component pinout information, we do not evaluate the BOM or pinouts independently, only the entire schematic -but do still include generating the BOM and pinouts in the prompt to facilitate chain-of-thought reasoning.Due to the challenges of automatic evaluation in this domain, evaluation was conducted manually by a domain expert through inspection, simulation, and physical circuit construction.

Results:
The results of the device generation task on the MICRO25 benchmark are shown in Table 2. Across all tasks, model-generated code ranged from a minimum of 13 lines to a maximum of 145 lines (average 38 lines per program).GPT-4 per-forms extremely well on the MICRO25 benchmark, correctly generating schematics and code for 96% of benchmark tasks.CLAUDE-V1 exhibits more modest performance, achieving 60% for schematic generation, and 76% for microcontroller code generation.Taken together, this shows that contemporary language models have moderate-to-excellent overall capacity for generating common electrical circuits end-to-end, from bills of materials, pinouts, and schematics, to paired microcontroller code that accomplishes the desired functionality.

Experiment 3: Open Device Generation
While we've observed that language models have the capacity to design comparatively simple devices in Experiment 2, this result is tempered by these benchmark tasks being representative of fairly common skills and capacities that are frequently taught in microcontroller-oriented curricula found in books, internet tutorials, and blog posts -and as such, the MICRO25 tasks likely exist in some form in the voluminous (but closed) training data of these models.Here, we examine how well the best-performing model, GPT-4, can create comparatively more complex devices in a more realistic and qualitative setting, where it is used as a design assistant like GITHUB COPILOT (Chen et al., 2021) to create initial plans that are then vetted and corrected by a domain expert before being physically constructed.To further increase task difficulty, each of the device specifications in this experiment were explicitly crafted to be highly unusual -either using uncommon components, or combining common components in unusual ways -such that the likelihood of similar devices appearing in the closed model training data is low.The six devices are shown in Figure 4.
Methods: Initial specifications (in the form of a natural language textual description of a device) were created for all devices, and iteratively refined several times to provide clarifications in response to undesired or errorful model-generated output.After several attempts at refining device descriptions, any remaining errors were manually corrected by a domain expert, then the devices were physically manufactured.All devices were constructed by a human, and physical aspects of the design (e.g.printed circuit boards, 3D printed or laser cut components) were designed by a human.Full device description strings, model prompts, and detailed qualitative descriptions of manual corrections re-quired by a domain expert to reach functionality are described in the APPENDIX.
Qualitative Challenges: Generation challenges can be organized into two categories: promptspecific challenges, and hardware-specific challenges.With respect to prompts, device generation is highly sensitive to the specific prompt, and small (and seemingly helpful) changes in the prompt to address an error can cause new errors to occur in other aspects of the device that were previously generated correctly.Similarly, having many composite requirements in the project (such as adding the requirement for each key in the emoji keyboard to generate its own musical tune) generally decreases performance, and suggests that iteratively generating devices from the core requirements (such as generating a functioning emoji keyboard) through to more fine-grained details (like adding in musical tunes) may reduce the inference load at each generation step, improving generation performance.With respect to hardware, a number of pragmatic issues occur.Electronic parts regularly reach endof-life cycles and are no longer manufactured or easily available, yet the model frequently generates these, likely due to the abundance of examples that use these components in internet tutorials.Similarly, the model frequently uses deprecated versions of libraries, or combines the features from different versions of libraries.Finally, the model generally performs poorly at generating low-level device drivers for specific hardware (such as sensors), and favors using existing device driver libraries.When an existing library isn't available, the model will either hallucinate one, or generate a reasonable first-pass at a device library that requires extensive modification to low-level details (like clock timings) to function.

Challenges and Discussion
We identify the following challenges and opportunities in developing this capacity for automated device design further: Prompt Sensitivity: Currently, small changes in the prompt can cause large changes in the output, affecting overall performance.While this is evident from small changes in task description strings producing novel errors in Experiment 3, this phenomenon is also visible for simpler cases.For example, the pinouts for some components -such as the MLX90614 temperature sensor -are incorrect when tested independently in Experiment 1, but Emoji USB keyboard that has keys for 9 common emojis.Pressing an emoji types its ASCII string, just as if entered on a normal keyboard.A short musical tune with similiar a ect to the emoji (e.g. a love song for the heart emoji) is also played.
Random number generator that uses an ambient radiation sensor to continuously update the random seed.Random numbers are generated every few seconds, and displayed on a vintage nixie tube using a high-voltage driver.
Visible light spectrometer that measures spectra using a Hamamatsu micro-spectrometer and displays the spectrum on a 128x128 pixel OLED screen.Here, the device is shown measuring the characteristic spectrum of a white LED.
Non-contact temperature sensor that displays the temperature on a strip of 8 LEDs.Higher temperatures show as red colors and illuminate more LEDs, while lower temperatures show as blue colors and illuminate fewer LEDs.
Pill alarm assistive device that has alarms for three pills.When it's time to take a pill, the alarm continuously waives a physical ag saying "Take Pill X" back and forth using a servo motor to get the users attention, until the button is pressed.
Ultrasonic glasses assistive device that uses an ultrasonic distance sensor to measure the distance of the nearest object in front of a person with a visual impairment.The distance is converted into an audible tone of varying frequency.correct when generated as part of a full circuit in Experiment 2. This suggests that task performance is not currently robust, and may benefit from creating and fine-tuning on a task-specific dataset.
Manual Evaluation: Like COPILOT, we have observed that the electronic designs generated by lan-guage models are rarely perfect and frequently have errors.Currently these have to be discovered and corrected by a human.Contemporary work in code generation aims to use reflection (e.g.Shinn et al., 2023) to iteratively run generated code in an external interpreter (like PYTHON), report any errors to the language model, then continue this process until the generated code runs error-free.The lack of electronic simulators with large libraries of simulated devices presents a significant barrier to this form of automatic evaluation in the near-term.Ultimately this may be addressed by constraining circuit generation to only parts available within a given simulator, or pushing a focused effort to developing more capable simulators with a larger repertoire of components.
Generating devices with common-sense knowledge: Language models contain a variety of common-sense reasoning abilities (West et al., 2022;Liu et al., 2022a), and leveraging these abilities may enable new applications.For example, in the context of assistive devices, GPT-4 is able to infer that an ultrasonic sensor can be used to create assistive glasses to aid the visually impaired with navigation.Similarly, the model can use its common-sense knowledge to design devices that contain the most common emojis, or keys for all the prime numbers up to 20 just as easily.Ultimately, electronic devices may be distributed as templates, that can be semi-automatically customized to a variety of applications based on user preferences.
Quantifying time savings of automatic versus human device design: Precisely quantifying the benefits of automated coding assistants (such as GITHUB COPILOT) is challenging, and currently measured at least in-part with qualitative measures (Ziegler et al., 2022).These assistants may provide large time savings when they function correctly, but likely increase debugging time when they generate problematic code, complicating measuring their precise benefit.A similar situation likely exists here for device generation, and we provide only the following anecdotal account: the best-performing GPT-4 model described in this work produces in minutes what undergraduates in our course might initially take hours to days to perform, as they learn to adapt their existing computer science skills to the electronics and microcontroller domain.As such, in the near-term, systems such as WORDS2WIRES might be viewed as productivity assistants that allow (for example) scientists with existing coding skills but minimal electronics knowledge to quickly design instrumentation (such as data loggers) or other customized devices that are relatively modest in scope and complexity with a minimal time investment.

Conclusion
This study empirically characterizes the previously unknown potential of contemporary language models to move from words to wires -that is, to generate working electronic device designs from highlevel text descriptions.Our analysis demonstrates these models have moderate-to-high proficiency in generating component-level knowledge on the PINS100 benchmark, while GPT-4 significantly outperforms CLAUDE-V1 at generating 25 fullyfunctional devices from the MICRO25 benchmark, reaching near-perfect performance.When used as a design assistant for generating six more complex devices, language models can generate devices that nearly meet specifications, but still require moderate correction by domain experts to function.While this novel application of language models inspires the democratization of electronic device creation, further development is currently tempered by the lack of simulators to automatically evaluate designs, and the highly manual nature of this process.

Limitations
This work has a number of limitations, including: Device scope: The devices generated in this work are small in scope, with limited functionality -typically a small number of components, fewer than 50 lines of code, and controlled by ARDUINO microcontrollers which are frequently limited to only 2K of memory.This work does not address designing moderate or complex devices such as phones, personal computers, or other devices that are orders of magnitude more complex in terms of component counts and code length.For context, being able to successfully design all the devices in the MICRO25 benchmark would be equivalent to the performance of a particularly strong undergraduate student after having taken a first course in microcontroller design at our institution.
Generation accuracy: While the simple devices in Experiment 2 can reach high generation accuracy, particularly with GPT-4, nearly all devices in the more complex open generation condition in Experiment 3 contained errors, and required correction by a domain expert.In the open generation, three of six devices (emoji keyboard, non-contact temperature sensor, ultrasonic glasses) were generated in essentially functional forms in their base conditions (i.e.before adding additional requirements, such as playing music when keys are pressed).A detailed error analysis is provided in APPENDIX D.
Physical design and manufacture: The physical manufacturing of the devices -including building circuits on prototyping breadboards using jumper wires, designing printed circuit boards, or designing physical 3D printed or laser cut enclosures, was entirely manual and completed by a human.While technologies (such as autorouting) exist to automate some of these aspects, they were not used in this work.Similarly, while language models have been shown to have some capacity for generating 3D object models (e.g.Nichol et al., 2022), that capacity has not yet developed to where it would be possible to generate enclosures or other mounting hardware required for physical device construction.
Safety: Constructing electronic devices has real dangers and potential harms, including but not limited to the risk of fire, electrical shock, and equipment damage, and should not be attempted by nonexperts.The development environment is notoriously hostile to components, and even experienced electrical engineers frequently face safety challenges or accidentally destroy components.Generated devices should always be vetted by a domain expert, and not used for safety-critical applications, or applications where harmful unintended effects may be possible.
Scope of Component Knowledge: Popular electronics distributors in the US currently stock millions of different electronic components.Though many of those components belong to particular component classes that largely share pinout information (for example, DIGIKEY, a popular USbased distributor, lists approximately 1.5 million specific resistors, each with two pins), many of these -such as the 194,000 sensors currently available -do not generally share common pinouts or functions.The 100 common electronic components used in the PINS100 benchmark are representative of common components found on electronics tutorial websites, and that are frequently required to build basic digital microcontroller-controlled circuits -but are by no means an exhaustive set of the possible components available to construct electronic circuits.
Speeding electronic design: Just as coding assistants such as GITHUB COPILOT can increase human productivity for coding tasks (Ziegler et al., 2022), the use of a suite of electronic design assis-tants may similarly increase productivity in electronic device design, reducing the design process from days to hours (or, minutes).Currently, language models make a variety of errors on complex devices, and these errors are not always easy to predict.As such, the utility of language models as design assistants may be tempered in the near-term by the time required to manually review every aspect of a design for accuracy.As simulators and other automated evaluation methods become available, some of this burden of manual design review will decrease.

B Additional Experiment Details B.1 PINS100 Benchmark Evaluation
Two additional considerations complicate evaluation, which are addressed here.First, the same electrical component may come in different packages, or be built by different manufacturers -for example, the power pin (e.g.VCC) on a given component might be on pin 3 in one package, and pin 5 on a different package.As such, evaluation requires only that the pin name be correct (e.g."VCC"), but does not require producing pin numbers, which are typically matched when choosing a particular component package during the circuit board design phase.Second, the specific names for pins are often described differently -and frequently with only single letters.For example, a pin with a reset functionality might be described variously as "RESET", "RST", or even simply "R" in different sources of text, such as official part datasheets or web tutorials, with each label being correct.As such, model-generated output is evaluated fully manually by a domain expert, who requires either generated pin names to match official documentation, obvious short forms, or (in the case of large differences) alternative part naming conventions found through a web search.This alignment process mirrors the actual electronics design process, where a reference schematic for a circuit may use different pin names than an official datasheet, and these pins need to be manually aligned by a domain expert by searching through a variety of reference materials.

B.1.1 Example of a non-critical pin
The permissive scoring metric for the PINS100 benchmark allows for missing or incorrect information for non-critical pins.A non-critical pin in this context is defined as a pin that is not, strictly speaking, required for basic use of a component.An example of a non-critical pin is the DRDY (data-ready) pin on an HMC5883L magnetometer.
The magnetometer is an I2C device that nominally requires only 4 wires to function (if wired on a breadboard): SDA, SCL, VDD, and GND.The DRDY pin is marked as optional (not required) on the datasheet, as connecting the DRDY pin primarily allows faster polling rates by signifying when measurements are ready to be read by the host microcontroller.

B.2 MICRO25 Benchmark Evaluation
Automatic evaluation of full devices faces a number of challenges, including that many different solutions are possible for each task, and existing simulators typically lack many of the possible components a model might generate in a solution.As such, evaluation was conducted manually by a domain expert.The schematics and code were manually inspected for functionality.When non-trivial or uncommon solutions were generated, the circuits were evaluated by constructing them in a simulator (AUTODESK TINKERCAD, shown in Figure 3) when possible, or physically building the circuits when not possible.When circuits used difficult to source or obsolete components, evaluation occurred through manual inspection, and comparing the generated schematics and code to reference materials.For a schematic to be considered correct, it must contain all relevant components, and be wired correctly in a way where code could be written to accomplish the desired task.For code to be considered correct, it must correctly perform the task given the generated schematic -or for cases where the schematic was incorrect, be able to accomplish the task as-is were the schematic corrected.

B.3 Description of Human Evaluator
The domain expert used for evaluating this work is an author of this work, with the following qualifications: The evaluator is an award-winning science educator, and prolific internationally-recognized open source hardware author with approximately 50 articles describing their open source hardware work in popular international news media such as Reuters, Forbes, and the Washington Post.The benchmarks described in this work were both authored and evaluated by the domain expert as reflective of the content of a popular full-term (4 month) undergraduate course in rapid prototyping and microcontroller design intended for computer science and information students, typically undertaken in a student's final year of undergraduate studies at an R1 ("very high research activity") university in the United States.The domain expert has delivered this course approximately 10 times to approximately 500 undergraduate students.

B.4 Representation of Full Devices
Additional information on device representation in Figure 3: The bill of materials format expresses canonical information typical in the design process, including the component type (e.g."resistor"), component name in the schematic (e.g."R1"), component value (e.g."10k ohms"), as well as a note on the purpose of the component (e.g."current limiting resistor for LED").The pinouts are expressed as a dictionary containing lists of pins for each part, as in Experiment 1. Code is expressed between MARKDOWN code blocks to ease extraction.Schematics are expressed as "netlists", which are a common storage format frequently adapted by electronic design tools.This format is analogous to an undirected graph, where edges represent a given connection from one component pin (such as the anode of an LED) to another component pin (such as one terminal of a current-limiting resistor).

B.5 Description of 6 Open-Generation Devices
Six devices, crafted to use uncommon components, or common components in uncommon ways.The six devices are: 1. Random number generator: a random number generator, using two uncommon components: (a) a radiation sensor to help provide a random seed based on ambient radiation levels, and (b) a high-voltage nixie tube (or cold-cathode display), similar to a vacuum tube, to display randomly generated digits.Nixie tubes were manufactured and used in the 1950s and 1960s before light-emitting diodes became common.
2. Emoji keyboard: a USB keyboard that only contains keys for common emoji characters.
3. Spectrometer: a visible light spectrometer using the uncommon Hamamatsu microspectrometer, and that displays the spectrum on an organic LED (OLED) display.
4. Non-contact temperature: a device that measures the temperature of an object using a common infrared-based non-contact temperature sensor, but displays the temperature in an uncommon way: as a color-changing bar graph on an LED display.
In addition, two assistive devices were explored: 5. Pill alarm: a common pill-alarm, that displays the current time on an LCD display.The alarm is presented in an uncommon way: by using servo motors to physically waive flags that say "take pill X" for hearing-impaired users, until they press a reset button.
6. Ultrasonic glasses: common components (an ultrasonic distance sensor and piezo buzzer) used for an uncommon purpose -to create a pair of glasses for the visually-impaired.The glasses audibly notify the user of the distance to objects in front of them using a tone whose frequency varies with distance.

C Prompts
The full prompt for Experiment 1 (component pinouts) is provided in Table 3, while the full prompt for Experiments 2 and 3 (device generation) is provided in Table 6.All prompts are static -that is, the same format/n-shot examples shown here are also shown in every generation requestwith the exception of the task strings ({bolded} in the tables), which are substituted with task-specific strings (representing the specific user-requested device to generate) at runtime.For Experiment 1, this is limited to the component name to generate pinout information for (e.g."DC motor").For experiments 2 and 3, this is limited to the target microcontroller platform (e.g."Arduino Uno") and plain-text device description (e.g."create a USB keyboard that only has buttons for the 9 most popular emojis on it".
Your task is to generate a description and pinout for an electronic component.

D Error Analysis: Modifications to Open
Generation Devices

D.1 High-level Qualitative Challenges
Overall high-level qualitative challenges in designing the six case study devices are described briefly below, where a detailed description of errors and corrections for each of the six devices is provided in APPENDIX D.2.
Sensitivity to prompt: Small and seemingly helpful changes in the device description or prompt can cause large changes in generation.For example, for the random number generator, including a reminder that the radiation sensor required a pull-up resistor appeared to cause the model to forget to include a high-voltage supply for the nixie tube.
Many requirements can create poor performance: Adding many composite requirements to a project, even when they are individually easy, can create low performance.For example, for the emoji keyboard, adding the requirement to play a relevant musical tune when each key is pressed generally produced only scaffolds for music generation code without actually including the melodies.
A subsequent call to GPT-4 asking it only to fill in this music scaffold was required to generate the melodies.
Device drivers: The model performs best for straight-forward circuits where the coding portion of interfacing with external components (such as sensors) is abstracted to existing libraries.When writing a low-level device library is required, the model commonly either hallucinates a non-existent library, or generates a reasonable first-pass at a device library that requires extensive modification to function.For example, the microspectrometer device driver the model generated had the essential conceptual-level components -i.e. that data needed to be clocked out of the spectrometer and read by an analog-to-digital converter after sending a start pulse to the spectrometer -but the generated code had incorrect clock timings, logic levels, and other fine-grained details which made it unable to function without correction.End-of-life parts: Electronic components regularly reach end-of-life cycles, and are no longer manufactured or easily available.The model occasionally generated circuits that used unavailable parts, and had much less competency generating circuits for newer part variants, particularly those that were released near GPT-4's knowledge cutoff date of September 2021 (OpenAI, 2023).

D.2 Device-Specific Errors
The devices generated in the open generation experiment generally required modification to function as intended.Here we provide a list of the major design changes required to reach functionality:

D.2.1 Random number generator
Both the radiation sensor (Radiation Watch Type 5) and the nixie tubes (IN-12A) are highly uncommon components, and are likely to have limited examples available in existing documentation.Generally, across several iterations of device description prompts, either the radiation sensing circuit was correct, or the nixie tube circuit was correct, but not both.The radiation detector requires a pull-up resistor to function, and is pulled low when a highenergy particle strikes it.The nixie tube requires an external high-voltage driver, which was usually generated correctly, but when generated incorrectly it was typically powered by USB voltage (5V) instead of the required high-voltage (170V).Across device descriptions, the method used in the code to set the random seed based on the radiation sensor varied -some useful, some largely incorrect.

D.2.2 Emoji keyboard
The base emoji keyboard was largely generated without issue, though did mix up the version for the button input library, and failed to mention the special programming requirements for the Teensy microcontroller to place it in human-interface-device (HID) mode to act as a USB keyboard.The specific emojis were chosen by the model, adding only the requirement that they must be high-frequency, and at least one must be the heart emoji.Adding many requirements generally reduced design quality -for example, adding the requirement that some of the emojis needed to be at least 5 ASCII characters long was not generally successful (and, the model occasionally generated emojis that were unicode, which is generally easily supported by the USB HID standard, or occasionally generated only single characters instead of full emojis).Similarly, adding the requirement for a short musical tune to play upon pressing an emoji, where the tune should have a similar affect to the emoji (e.g. a love song for the heart emoji, a happy song for the happy emoji, etc.) generally produced only harsh single tones, or the scaffold for generating the music without actual musical tones for each emoji.This scaffold was provided to GPT-4 on its own in a post-generation step, and the resultant code added to the original code.

D.3 Visible Spectrometer
This code had two central challenges: generating a device driver for an uncommon component (the Hamamatsu C12666MA micro-spectrometer), and using a library for a common component (a display with a common controller).Using different device descriptions, the model either hallucinated non-existent libraries for interfacing to the spectrometer, or generated its own libraries that had the high-level procedure correct (e.g.sending a start pulse to the spectrometer, then continuously sending a clock pulse while reading data using an analog-to-digital converter to read out each of the 256 spectral channels) -though the specifics of the device driver, such as timing or logic levels were typically incorrect and needed to be manually corrected.With respect to the display, three OLED and TFT displays with common display controllers were attempted, and the most successful (the 128x128 OLED using a SSD1351 controller) was used.There were only two small errors in the display code: the initial call to the display had reversed the order of the arguments, and the last call to the display (swapping the backbuffer) was for a different library, and not required here.

D.4 Non-contact temperature sensor
This device consisted of two common components (an MLX90614 non-contact temperature sensor, and an 8-pixel neopixel RGB LED strip).The device generated without issue, and was only modified slightly to reverse the direction the LED bar graph displayed from (to accommodate the mounting constraints of the specific LED strip used).

D.5 Pill alarm assistive device
This device consisted of three common components: an 16x2 LCD with an I2C interface, three hobby servos, and a single pushbutton.The schematics generated largely without issue.The code generally had a number of logic errors that needed correction when the added requirement of oscillating the flags back-and-forth was added, including that the code would oscillate all flags, regardless of which alarm (e.g.pill 1, pill 2, or pill 3) was active.Different generated instances of this device in response to different specifications either kept track of time internally, or used an external realtime clock module for more accurate timekeeping -but all generated devices failed to provide any means of setting the initial time of the device other than manually in code, which is an important usability feature of a clock not explicitly mentioned in the prompt.

D.6 Ultrasonic glasses assistive device
This device generated with only small issues.The device requirements specified using a specific battery-powered ESP32 microcontroller board, but the schematic used digital pin numbers that were unavailable on this specific microcontroller boardthese were assigned to other pins trivially.The library the model used for sound generation (TONE) is famously available for most Arduino devices except the ESP32, and was modified to use a different function specific to the ESP32 with a similar signature, but with two added initialization and termination calls.The specific audio frequency range generated by the model was also modified to a reduced range and more fitting for human ears, as the original included high-frequency tones that, while audible, were uncomfortable and resembled a fire alarm.
E Device Descriptions: 25 Benchmark Tasks and 6 Open-generation Devices The full device task description strings for the MI-CRO25 benchmark are shown in Table E.A set of iterated task description strings for the opengeneration condition (Experiment 3) are provided in Table 5.These and additional task descriptions are provided in the GITHUB repository.

Device Task Description String
Random number generator Radioactive dice: a device that uses the radiation rate from a radiation watch type 5 sensor (which outputs a digital signal, active low, depending on whether a high-energy particle has struck it at that moment or not) to determine the random seed for an electronic dice.The device should continually read the radiation sensor, accumulate the count, and use it to help change the random seed periodically.Every 3 seconds, the device should display the roll of a 6 sided dice on a Nixie tube.It should use an IN-12 nixie tube, and K155ID1 driver.Emoji USB keyboard Create a keyboard that plugs in as a USB device, but instead of a full keyboard it has only a small number of buttons.The keyboard should only have buttons for 9 popular emojis, expressed as ASCII characters, not unicode.One emoji should be a heart.There should be an LED that's on all the time, but blinks off for 500 milliseconds when a button is pressed.There should also be a piezo buzzer, that plays a brief tune that is of the same affect as the emoji being pressedfor example, a love song for the heart emoji, a happy song for a happy emoji, sad music for a sad emoji, and so forth.Visible light spectrometer Create a visible spectrometer that continuously displays the spectrum on an OLED display.It should use the Hamamatsu C12666MA 5v-compatible mini-spectrometer for the spectrometer (pins: 5V, GND, EOS, START, CLK, GAIN, VIDEO).The display should be a 128x128 pixel OLED with a SSD1351 controller and SPI interface, also 5V compatible (pins: GND, VIN, CD, MISO, SDCS, OLEDCS, RESET, DC, SCK, MOSI).Non-contact temperature sensor Create a non-contact temperature sensor using the MLX90614.The temperature should be output on a 8-pixel neopixel strip.0 degrees or below should light only the first neopixel.For each 10C after, another neopixel should light.The color of the neopixel should change according to its temperature (blue=cold, green=mild, yellow=warm, orange=warmer, red=hot).

Pill alarm
Create a pill alarm.The alarm should have a clock that prints the current time on a 16x2 character LCD display.If the time is 6:30am, noon, or 6:30pm, the device should raise one of 3 flags (signifying different pills need to be taken).Servo 1 controls flag 1, servo 1 controls flag 2, and servo 3 controls flag 3. When raised, the servo should move from 0 degrees to 90 degrees.The servo should stay up until a button is pressed, after which it's reset to the down position (0 degrees).While raised, the servos should slowly oscillate between 45 and 90 degrees, to help get the user's attention.

Ultrasonic Glasses
Glasses for the blind that provide a helpful sound that corresponds to how close something is in front of them.Should have a slide switch that can disable the sound.Please use the MaxSonar ultrasonic distance sensor.
Table 5: Example iterated task description strings for the 6 open-generation devices in Experiment 3. A set of initial task descriptions was progressively iterated, expanded, and refined based on task performance, before arriving at the above task descriptions.

Figure 2 :
Figure 2: An overview of the electronics device design process, from concept to design implementation.
Code Generation: Generating code from intermediate planning representations (such 1 of R1 (current limiting resistor) to # Uno Digital I/O 5 (D5), to activate/deactivate D1 # Connect D1 anode to pin 2 of R2 (current limiting resistor) # Connect D1 cathode to Uno GND Schematics as Annotated JSON (Netlists, as output by the LLM) Turn on an I/O pin when

Figure 4 :
Figure 4: Six devices designed using WORDS2WIRES in the open generation condition, then physically constructed.
exactly 2 of 4 buttons are pressed.

Table 2 :
Model performance (PASS@1) on the MICRO25 benchmark generation tasks, broken down by schematic and code.Task descriptions are summarized for space, where full task descriptions can be found in APPENDIX E.

Table 4 :
The full (i.e.not truncated or summarized) task strings used for the MICRO25 benchmark.The target platform for each is the ARDUINO UNO microcontroller.