Deep learning automatic front-end development: 5 seconds from sketch to HTML

With people’s continuous exploration, the method of “using artificial intelligence to automatically generate web pages” has become more and more practical. The convolutional neural network named SketchCode introduced in this article can directly translate the design sketches of the website’s graphical user interface into lines of code, sharing some of the design flow for front-end developers. At present, the model’s BLEU score has reached 0.76 after training.
You can find the code for this project on GitHub:

Creating an intuitive, attractive website for users is an important goal for companies, and it is a process of rapid prototyping, design, and user testing. Large companies like Facebook have the manpower to keep the entire team focused on the design process. Changes may take several weeks and involve multiple stakeholders. Small businesses do not have such resources, so their user interface may be affected by influences.

My goal in Insight was to use modern deep learning algorithms to greatly simplify the design workflow and enable companies of any size to quickly create and test web pages.

Existing design workflow

Existing workflow involves multiple stakeholders

A typical design workflow is as follows:

Product Manager conducts user studies to develop technical parameter tables

The designer will accept these requirements and try to create a low-fidelity prototype that will eventually create a high-fidelity prototype

Engineers translate these designs into code and eventually deliver the product to users

The length of the development cycle quickly becomes a bottleneck, and companies like Airbnb have begun to use machine learning to increase the efficiency of this process. (See:

Airbnb Internal AI Tool Demo: From Sketch to Code

Although this tool is promising as an example of machine-aided design, it is unclear to what extent this model can be fully trained end-to-end, and it is not clear to what extent it depends on hand-made image features. This is definitely not possible because it is still Airbnb’s proprietary non-open source solution. I want to create an open source version of the “from drawing to code” technology that is available to more developers and designers.

Ideally, my model could use a simple web design hand-drawn prototype and immediately generate a usable HTML site from that image:

SketchCode model needs to draw a good website wireframe and generate HTML code

In fact, the above example is an actual website generated from my model test set image! You can view it on my Github page:

Inspiration from image annotation

The problem I am working on is in the general task category of program synthesis (, which is the automatic generation of work source code. Although a lot of program synthesis can handle code generated from natural language requirements or execution trajectories, in my case I could start with a source image (a hand-drawn wireframe) and automatically get the code I wanted.

In the field of machine learning, there is a field called image subtitle generation (, which has ample research to learn the model that connects images and texts. In particular, generate a description of the source picture content.

The image annotation model generates a description of the source image

I got inspired by a recent paper called pix2code and a related project by Emil Wallner using this method (see: Front-end panic? Automatically generate HTML code with deep learning), and decided to refactor my task into image subtitles As part of the problem generation, the wireframe is used as the input image and the corresponding HTML code is used as the output text.

Get the correct data set

Taking into account the method of image annotation, my ideal training data set is the equivalent of thousands of pairs of hand-drawn wireframes and their HTML code. As expected, I couldn’t find this dataset, so I had to create my own data for this task.

I started with an open source data set ( mentioned in the pix2code article, which consists of 1750 artificially generated web screenshots and their corresponding source code.

Website image generated in pix2code and its source code data set

This data set is a good start for me and there are some interesting things:

Each generated site in the dataset contains several simple Bootstrap elements such as buttons, text boxes, and DIVs. Although this means that my model will be limited by the use of these elements as “vocabulary” (the model can be used to generate elements for the site), this approach should be easily generalized to the larger elemental vocabulary.

The source code for each example contains domain-specific language (DSL) tags created by the author of the paper. Each tag corresponds to a snippet of HTML and CSS, and there is a compiler that translates the DSL into working HTML code.

Make the picture more like a hand-painted

Switch the website’s colorful theme to a hand-written theme.

In order to adjust the data set to suit my task, I had to make the site’s picture as hand-painted. The hand-painted pictures benefit from the grayscale conversion and contour detection functions of the OpenCV and PIL libraries.

In the end, I decided to directly modify the original site’s CSS stylesheet through a series of actions:

Facilitate the rounding of buttons and divs by changing the border radius of the page element

Adjust the thickness of the border to mimic a hand-drawn sketch and add a shadow

Change the font to type handwriting

My final version added one more step to data enhancement by adding tilt, offset, and rotation to mimic the uncertainty of the sketch actually drawn.

Using image annotation model architecture

Now that I have my data ready, I can put it into the model for training!

The model I use for image annotation includes three main parts:

A Convolutional Neural Network (CNN) Visual Model for Extracting Source Picture Features

A language model consisting of a gating cycle unit (GRU) that encodes a sequence of source codes

A decoder model (also a GRU) that takes the output of the previous two steps as input and predicts the next token in the sequence

Train the model using the marker sequence as input

In order to train this model, I divided the source code into tag sequences. One of the sequences and its source image is a single input to the model, and its label is the next mark in the document. The model uses the cross-entropy cost as its loss function to compare the next marker predicted by the model with the actual marker.

In the reasoning stage where the model generates code from scratch, the process is slightly different. The image is still processed through the CNN network, but text processing only provides a starting sequence. At each step, the model predicts the next marker in the sequence, which is returned to the current input sequence and is entered into the model as a new input sequence. Repeat this until the model predicts that the <END> tag or process reaches the predefined upper limit of the number of tags per document.

Once a set of predictor tags is generated from the model, the compiler converts the DSL tags to HTML, which can be displayed in any browser.

Use the BLEU score assessment model

I decided to use the BLEU score ( to evaluate the model. This is an evaluation criterion that is often used in machine translation tasks. It attempts to evaluate the degree of similarity between machine-generated text and human-written text given the same input.

In essence, BLEU generates fine-tuned text by comparing the n-meta sequence of generated text and reference text. It is very suitable for this project because it will affect the actual elements in the generated HTML, as well as the interrelationship between them.

Then this is the best – I can understand the BLEU score by checking the generated website!

BLEU score visualization

A perfect 1.0 BLEU score will generate the correct elements of the source image in the right place, while lower scores can predict the wrong elements and/or put them in the wrong position relative to each other. In the end my model was able to get a 0.76 BLEU score on the test set.

Benefits – Custom Style

One additional benefit I perceive is that since the model only generates the skeleton of the page (the mark of the document), I can add a custom CSS layer during the compilation process and instantly see the different styles of the site.

One conversion => generate multiple styles at the same time

Separating the style from the model generation process brings many benefits to the usage model:

Front-end engineers who want to apply the SketchCode model to their own company’s products can use the model as-is, just change a CSS file to meet their company’s style requirements

Scalability is built in – Using a source image, the model output can be immediately compiled into 5, 10, or 50 different predefined styles, so users can see multiple versions of their site and navigate through them in a browser

Summary and outlook

By leveraging the results of image annotation research, SketchCode can convert hand-drawn website wireframes into usable HTML sites in seconds.

The model has some limitations, probably including the following:

Since this model is trained with a vocabulary of only 16 elements, it cannot predict markers other than training data. The next step may be to use additional elements (such as images, drop-down menus, and forms) to generate other sample sites – Bootstrap components is a good site to practice with:

There are many changes to the website in the actual production environment. A good way to create a training dataset that better reflects this change is to take a screenshot of the actual website and capture their HTML/CSS code and website content.

Hand-drawn sketches also have a lot of changes, and the CSS modification skills have not been fully learned by the model. A good way to generate more variations on hand-drawn sketches is to use creation against network to create realistically drawn website images

I am looking forward to seeing the further development of the project!