Generated and original content: an example of an approach for identifying their similarity in a dataset of artistic images

The rise of generative AI has led to the usage of millions of original images to generate new ones. In this context, identifying which specific original works were used to produce a particular artificial image is not only very costly in terms of computational power but also uncertain. PEReN explores an alternative approach in this prototype, based on the nearest neighbors search method, which allows for a low-cost identification of the images in the training set that are most similar to a generated image. Although imperfect, this approach provides a method for objective comparisons, serving as a basis for subsequent discussion.

Access the code repository

An applicable academic method: k-nearest neighbors search

The research field of “Training Data Attribution” explores responses to the technical issue raised. Initially focused on studying the influence of training data on generated content, new research is deepening these techniques to identify the original contents used to train generative AI models.

Two types of methods are at work:

Causal methods, such as SHAP, which specifically study the influence of a particular training data item on the training process and generation. The concept of causality implies that if the training data had not been present, the resulting output would necessarily have been different. In practice, however, this is almost impossible to demonstrate on very large datasets ;
Non-causal methods, which aim to identify, for a generated content, the training data that are the closest (stylistically, in terms of content, or in more “mathematical” aspects), but where the removal of these data points would not necessarily result in a difference in the generated content. It is thefore theoretically possible for similarity to be coincidental. In this case, similarity methods such as nearest neighbors search are both simpler and faster to run than causal methods.

Causal methods offer robust explainability in a statistical and theoretical sense, but their implementation is complex and resource-intensive, even for lighter approaches such as SHAP or LIME. Designed to explain predictions of classifiers, they quickly reveal their limitations when facing the text-to-image generative models studied in this note, whose complexity and volume of training data make it difficult to precisely identify influences on the generated content.

We therefore explored a non-causal approach that is simpler, more lightweight, and can be applied in a fraction of the computation time required for generation, even on very large datasets. The proposed identification approach is based on a method relying on the proximity between original contents present in a training database and those generated by the generative AI trained on this database. For a generated image, the goal is thus to find the original images that are the most similar to it. As our tests will show, the concept of similarity between works can cover several dimensions and depends on the chosen embedding model: for example, similarity in content or similarity in style.

Similar content search, or "nearest neighbor search," relies on three steps (see Figure 1):

Vectorization, which involves transforming reference images and generated images into vector representations called embeddings. These vectors capture the semantic and visual characteristics of the images. In this work, we used the CLIP embedder;
Indexing, which organizes the embeddings extracted from reference images within an optimized database for fast nearest neighbor searches;
Search, which identifies, for a generated content, the closest elements in this index according to a similarity measure.

Figure 1 : llustration of the nearest neighbors search system. In this example, the search was limited to 2 nearest neighbors.

The developed prototype underwent a series of tests to characterize its performance and limitations. The experimental framework focused on text-to-image models not integrated into systems such as RAG (Retrieval-Augmented Generation), which allows the inclusion of additional original images beyond those in the training dataset. In our case, the original content database that we use is a subset of the generative model’s training dataset. To select the attributions, we fixed a number K of nearest neighbors.

Tests conducted using a reference dataset

From the attribution dataset created by Wang et al. (2023) following the process illustrated in Figure 2, we selected a subset consisting solely of copyright-free works and images generated from them, comprising 1,576 original works and 8,400 generated images.

Figure 2: Principle of the attribution dataset derived from Wang et al.

The resulting reference dataset is divided into two subsets:

“gpt”: consists of 4,200 images generated using prompts related to abstract concepts generated by ChatGPT,such as “The grandeur of the past in the style of [artist] art”.
“object”, consists of 4,200 images generated using more specific prompts specifying a particular object to generate, such as “A painting of a flower in the style of [artist] art”.

In the attribution dataset, each artist corresponds to a set of original works. The synthetic images were generated by Wang et al. using a text-to-image model retrained with the works belonging to the respective artist.

An analysis of the method’s performance

Initially, two methods for searching similar content were considered:

exact search (KD-Tree method, K-Dimensional Tree);
approximate search (HNSW algorithm, Hierarchical Navigable Small Worlds).

A more in-depth comparison of the two methods is available in the appendix 1.

Given the small size of our indexing database, we could not observe any performance difference between the exact search method (KD-Tree) and the approximate search method (HNSW). Therefore, we only detail the results obtained at the overall dataset level and at the level of each generated image for the HNSW method (example of attribution in Figure 3). The number K of assigned images, set between 1 and 100 for this experiment, is fixed for all generated images.

Figure 3 : Example of attributions obtained using the HNSW method for a generated image, based on the attribution rank (from 1 to 100). It is observed that as the rank increases, the attributed images become visually more distant from the generated image. This visualization highlights the importance of varying the number of attributions K to analyze the impact on the accuracy of the method used.

To evaluate performance at the overall dataset level, we measure the percentage of generated images with at least one assigned image from the actual artist who inspired the generated image. In most cases, the method successfully attributes part of the inspiration to the correct artist, as illustrated in Figure 4.

Figure 4 : Proportion of generated images with at least one correctly attributed image (the correct artist is present at least once), as a function of the fixed number of K nearest neighbors.

Two metrics, averaged across each dataset, evaluate the performance for the generated images (see Figure 5):

precision (rate of correct images among assigned images) for different K values still needs improvement. Starting from two assigned images, it is less than 50%, an insufficient score, although significantly higher than random chance.
recall (rate of assigned images among correct images), for different K values. These values are low but are of the same order of magnitude as those in Wang et al. (2023)’s article.

Figure 5 : Average precision and recall on the set of generated images, as a function of the fixed number K of nearest neighbors.

We also observe that the method performs better on the “gpt” dataset (vague and abstract prompts) than on the “object” dataset (specific and concrete prompts). A possible explanation for this performance difference will be proposed later. In a second experiment, we varied K according to the number of real inspirational images used for each generated image. Based on the results (see Figure 13 in the appendix 2), we reached the same conclusions.

What factors influence the performance of the selected method? Four explored hypotheses

Hypothesis 1: The higher the number of artists attributed to an image, the lower the confidence in this attribution.

In the reference dataset used, each generated image is inspired by images from a single artist. It is therefore possible that a high diversity of artists in the images attributed by the method is linked to incorrect attribution: this could indicate that the style of the generated image was poorly discerned and could be attributed to several different artists. Conversely, if the method attributes images from a single artist, this could indicate greater certainty regarding the inspirational style of the generated image. If the hypothesis is confirmed, this provides an indicator of the confidence we can have in the attribution (see illustration Figure 6).

Figure 6 : llustration of the link between the diversity of assigned artists and the success of the assignment. At the top: maximum diversity of assigned artists is associated with a poor assignment. At the bottom: minimal diversity of assigned artists is associated with a good assignment.

Statistical tests were conducted on the relationship between the diversity of artists attributed to a generated image and binary performance metrics. Diversity is calculated using Shannon entropy on the attributed artists, ranging between 0 and 1. It is equal to 1 when as many distinct artists are attributed as there are images, and 0 when only one artist is attributed.

We first performed a Mann-Whitney test comparing the distributions of artist diversity attributed depending on whether the actual artist was present among the attributed images. The low p-value associated with this test (p = 5.4e-56) confirms the difference in distribution. We then performed the same test comparing cases where the attributed work closest neighbor corresponds to the original artist or not. A difference in distribution is once again identified (p = 5.9e-87). The results of these tests show a significant link between the diversity of attributed artists and:

the presence of the actual artist among the attributed images;
assigning the work by the actual artist as the closest neighbor.

The diversity of attributed artists can therefore serve as a partial indicator of confidence in the method’s predictions. However, this finding may not be generalizable to a reference dataset composed of images generated from original works of multiple artists.

Hypothesis 2: The lower the distance between embeddings, the more reliable the attributions.

So far, we have used the distance between embeddings to identify the nearest neighbors, but this distance is also information in itself. We could use it as an indicator of the level of confidence we can assign to an attribution. Correctly attributed images would have a low distance because they are very close, while incorrectly attributed images would have a higher distance since they are farther away semantically. A distance between the embedding of the generated image and its nearest neighbor below a certain threshold could therefore indicate high confidence in the attribution (see illustration in Figure 7).

Figure 7 : Illustration of the link between distance from the nearest neighbour and successful attribution. Top: a generated image inspired by Fra Carnavele, falsely attributed to a work by Gustave Caillebotte. Bottom: a generated image inspired by Carracci, correctly attributed to a work by Carracci.

We performed three Mann-Whitney statistical tests on the distribution of distances between embeddings. These distributions are indeed different depending on the presence of the actual artist among the attributed images (p = 2.4e-54), the identification of a work by the actual artist as the closest neighbor (p = 2.1e-106), and the singular presence of the artist in the attributed images (p = 4.0e-42).

The results tend to show a statistically significant link between the distance to the nearest neighbor of the embedding of the generated image and:

the presence of the actual artist among the attributed images;
the identification of a work by the actual artist as the closest neighbor;
the singular and unique presence of the artist in the attributed images.

A low distance between embeddings can therefore constitute another indicator of the relevance of the method’s attributions. This distance itself depends on the chosen embedder.

‍	Real artist present	Nearest neighbor from the real artist	Real artist only
Hypothesis 1: diversity of artists	p = 5.4e-56	p = 5.9e-87	N/A
Hypothesis 2: distances between embeddings	p = 2.4e-54	p = 2.1e-106	p = 4.0e-42

Table 1 : Summary table of statistical test results for two hypotheses. Here, we consider the “gpt” dataset and a number of attributed images equal to the number of inspiration images for each generated image. For each hypothesis, the performed test is the Mann-Whitney test between the distributions of two classes defined by the metric in the column. For example, the p-value associated with the diversity values of artists for images where the real artist is present versus images where the real artist is not present is 5.4e-56.

Hypothesis 3: The embedder used captures the content of an image more than its style

An image can be characterized by both its content (e.g., the objects present in the image) and its style (artistic movement, colors used, etc.). Thus, insufficient consideration of the style of images by embedders could lead to attribution errors.

It is possible that the embedder used (CLIP) primarily encapsulates the content of images while neglecting the style. Indeed, the image-text pairs in CLIP’s training dataset come from a web corpus called WebImageText, which includes 400 million image-text pairs. These texts primarily describe the visual content of each image to ensure accessibility, rather than the image’s style.
The embeddings extracted in this way would lose part of the information related to the image’s style, influencing the distance calculation results: two images with different styles but featuring the same objects might appear very similar. Thus, the attribution errors of the method could be linked to inadequate embeddings: choosing a more relevant embedder, i.e., one adapted to the priorities and attribution objectives, could improve the method at a low cost.

Specialized embedders for extracting style information, such as ALADIN (Ruta et al., 2021), re not available under licenses permitting their use. We therefore sought to verify whether attributed images were closer, in terms of content, to the generated image than to reference images. This could suggest that CLIP-based attribution favors content over style.

Given the difficulty of finding an embedder we can be certain encapsulates strictly only the content, we decided to generate descriptions to extract only the content of the images. Indeed, due to the nature of the training data described above, multimodal text-to-image models seem better suited to extract the content of an image. To quantify the similarity between two images in terms of content, we followed these steps (see details in the appendix 3) :

Generate descriptions of the images' content using a multimodal model;
Calculate embeddings based on these textual descriptions;
Compute the similarity between two images.

Figure 8 : Distribution of cosine similarity between embeddings capturing the content of attributed images and those capturing the content of real images used as inspiration. A higher similarity indicates more homogeneous content among the images. The Mann-Whitney statistical test is significant at the 0.0001 threshold. Here, we consider images attributed using the HNSW method, the “gpt” dataset, and a number of attributed images equal to the number of inspiration images for each generated image.

The results in Figure 8 show that the similarity distribution of content differs between attributed images and those that actually inspired the generated images (p = 6.6e-116). We also observe that embeddings are slightly more similar for attributed images than for reference images.

Thus, attributed images are more homogeneous in content than the real images that served as inspiration. This aligns with the hypothesis that the embedder encapsulates the content of images rather than their style (see illustration in Figure 9). However, this does not formally prove that images are attributed because they have similar content to the generated image.

Figure 9 : Illustration of the bias in embedding towards image content rather than their artistic style. On the left: the source images for the generated image, with a similar artistic style but different content. On the right: the incorrectly attributed images, with a different artistic style but similar content (boat, sea, clouds).

This greater homogeneity in content among attributed images could also explain why the proposed method performs worse on the “object” dataset than on the “gpt” dataset: the former is generated using prompts that enforce a greater variety of content (flowers, animals, landscapes, etc.) than the “gpt” dataset. An embedder more sensitive to content could explain a higher number of false attributions due to content farther from the actual inspirational images.

To strengthen these observations, using an embedder specialized in recognizing artistic styles, such as ALADIN (Ruta et al., 2021), or fine-tuning a standard embedder on artistic visuals (paintings, sculptures, etc.) based on the reference dataset could be a relevant future experiment.

Hypothesis 4: Injecting images into the index could impact attribution if they belong to the same domain as the generated image

When evaluating the robustness of an image attribution process, it is necessary to consider the possibility that the initial index may be incomplete or erroneous. In this context, we introduce “noise,” referring to the potential addition of extra images to the index: either homogeneous images (same domain, art), potentially completing an initially incomplete base, or heterogeneous images (différent domain, faces – FFHQ), which could disrupt the process.

To measure this robustness, we compare the initial attribution of original works (limited to a subset of the reference index) to the attribution after introducing noise.

The results in Figure 10 shows that:

The impact of adding homogeneous noise (here, the other part of the art works from the reference index is used as noise) on image attribution is significant;
The impact of adding heterogeneous noise (here, face images from the FFHQ dataset) remains limited.

Figure 10 : Average matching percentage between the initial assignment (without noise) and that obtained after introducing noise, depending on the level (10%, 25%, 50%, 75%) and the type of noise (Homogeneous/Art vs. Heterogeneous/Faces).

These results highlight a strong sensitivity to the index composition. An incomplete or contaminated index risks critical omissions in identifying the original works that inspired the generated images, posing a major challenge for the reliability of the attribution system.

In conclusion, a promising prototype but one which is sensitive to several key factors

The experiments show that the proposed method, based on similarity search, present promising efficiency, given that the reliability of the results must be analyzed taking into account the nomber of artists attributed by the algorithm, the distance between embeddings of original and generated images, or the presence of noise in the index.

An interesting property of this approach is that the computed similarity depends on the choice of the embedder, meaning that it is possible to choose one which is most adapted to the use case, whether it is by prioritizing the semantic information of the content in an image, its style, or some other parameter.

This prototype does not always make it possible to accurately identify the exact original works that inspired each image generated. Indeed, even if the truly influential artist is identified, other artists may also be attributed by mistake or not (for example, in the case of authors from the same artistic movement with strong markers). Tolerance to this type of error is configurable (by varying the threshold and the number of attributions) and remains a field to be explored in order to obtain results that are consistent on average.

These attribution mechanisms could form the basis for considering models of artist remuneration distribution based on their estimated contribution to the generation of a work. Many fields could be explored regarding the remuneration methods that would result from this attribution (see an example in Figure 11). Two approaches can be mentioned:

A weighted attribution based on similarity values: the closest images receive a weight proportional to their similarity with the generated image. The closer an image is, the higher its contribution.
A discrete attribution with fixed shares: predefined shares are allocated to the closest images according to a similarity order. For example, 90% for the first, 10% for the second.

Figure 11 : Illustration of the attribution of source images for an image generation. At the top, the two Cimabue paintings (“Virgin Enthroned with Angels” and “Madonna Enthroned”) used for generation according to the algorithm, each in equal parts, annotated “50%”. At the bottom, the two paintings actually used for generation, still indicated at 50% each, which turn out to be the same as those suggested by the algorithm, thus illustrating a successful example of attribution based on the similarity of embeddings.

Although the development of this prototype has focused on image-generative models, similar methods exist for audio. Regarding textual content generation, other techniques such as TracIN or DataInf are applicable, but the discrete nature of text generation can make identifying the proportion of training data that influenced the generation more complex.

These prototyping efforts, intended for exploratory and technical purposes, do not presume the cases in which they could be adapted or applied. They are intended to fuel emerging thinking. Many uncertainties remain regarding these different methodologies, inherent to the current limitations of AI systems in providing a clear and undeniable answer regarding the origin of generated content.

Appendix

For our prototype, we compared two specific techniques:

KD-Tree (K-Dimensional Tree), which enables exact nearest neighbor search by precisely partitioning the embedding space.
The HNSW (Hierarchical Navigable Small Worlds) algorithm, which performs approximate search with a controlled error tolerance parameter to achieve significantly faster results.

For both techniques, we use Euclidean distance as the similarity measure.

Figure 12 : Comparison of index construction time and nearest neighbors search time as a function of the size of the image database with which to compare a generated image, on a logarithmic scale.

Figure 12 illustrates the contrasting performances of the KD-Tree and HNSW methods depending on the size of the image database:

KD-Tree allows for fast indexing (left graph), but its search time increases significantly with the size of the database (right graph), reaching non-negligible values for large databases.
Conversely, HNSW has a higher initial cost for index construction but provides much lower and stable search times, even for a database of 300,000 images. This robustness in the face of increasing size and dimensionality makes HNSW particularly suitable for large-scale applications, such as searching for similar images in collections of artworks. This methodological trade-off between accuracy and speed is detailed in the comparative results in the table below:

‍	Accuracy	Speed	Ease of Implementation.	Scalability and robustness to variations
Nearest Neighbor Search (HNSW)	Approximate search, good accuracy with a slight margin of error	Very fast after indexing, but the construction phase is longer	Requires precise parameter tuning	‍Reliable despite variations and highly effective on large bases and large dimensions
‍Nearest Neighbor Search (KD-Tree)	Exact search, effective on small datasets but less performant in high dimensions	Fast indexing, but search time increases with the size of the dataset	Easy to implement	Less reliable and efficient when the data is large, suitable for small databases

Table 2: Summary Table of Advantages and Disadvantages of the Two Presented Techniques

Evaluating the method by assigning the same number K of images for all generated images is subject to a bias: not all generated images were created using the same number of real artworks. Between 1 and 67 artworks were used to retrain the generation model, depending on the generated image. Thus, generated images inspired by only two real images will necessarily have incorrect attributions if the method is asked to assign five inspirational images to them, not due to a flaw in the method itself.
To eliminate this bias, we assign for each generated image a number of images equal to the number of real images used to retrain the model that generated it. Consequently, in this framework, precision and recall metrics become equivalent. Figure 13 shows the attribution performance. We can also observe that the method performs significantly better on the "gpt" dataset than on the "object" dataset.

Figure 13 : Number of images assigned K identical to the number of inspiration images, for each image generated.

To generate embeddings capturing only the content of images, we use the following protocol:

Generate textual descriptions for all images in the "exemplar" folder using the prompt: "Describe the image by naming only the main objects depicted in 1 sentence and 1-10 words. Use as few words as possible. Emphasize the main objects." with the Qwen2-VL-7B-Instructmodel.
Compute embeddings for these descriptions using the SigLIPmodel (text embedder)
For each test image, calculate the average cosine similarity between pairs of embeddings in the set of attributed images and in the set of real images.