The Smart Gallery is a storage space which allows users to search their data not only by the filenames but also by the contents of their media.
This is the first of a series of blogs where we will be looking at some interesting problems we are working on at Kubric. This will lay the foundation for forthcoming blogs where we will explore each of the topics in greater depth.
At Kubric, we are faced with the challenge of building a Smart Gallery. The “smart” gallery is a storage space which allows users to search their data not only by the filenames but also by the contents of their media. For example, “black women dress” will return all the “female models” in “black dress”. Apart from keeping data organised and easily searchable, the Smart Gallery supports a variety of media formats for images, audio, video, fonts, etc., (including mov, webm, mp4, avi, mp3, jpg, png, pdf, svg, ttf, otf along with raw design formats like sketch, psd, ai or aep). It also provides advanced editing capabilities such as stickerize, background removal, just to name a few.
The user uploaded data is a bunch of files with limited textual information about them (an actual example: SHOT17-1406.jpg). To support searching by content, we enrich the data which involves extracting relevant details from the uploaded content. The data is unstructured because the information extracted will be novel for every new content the system encounters.
Since our customers all belong to different domains such as fashion, food, travel, etc., our problem manifested into multi-domain. The challenge now is to build “multi-domain search over unstructured data”. To understand how multi-domain search makes the problem even more challenging, consider an example where a user is searching for “mango”. Mango in the food domain would mean a fruit whereas in the fashion domain it would mean the brand “Mango”.
The diagram below shows the major components of Smart Gallery:
- Data Enrichment
- Ingestion pipelines
We will discuss each of these components and their challenges in the following sections:
Building a taxonomy
Building a taxonomy is the most important step in understanding domain knowledge. This is a prerequisite to data enrichment and it requires domain expertise.
We collected data across different domains by scraping different websites and used it for building taxonomies manually. The fashion domain taxonomy includes brands, categories, subcategories and their correlation, whereas the food domain taxonomy has data around dishes, cuisines, ingredients etc.
It is also important to consider synonyms while creating a taxonomy so we created a synonym list for the collected data (example: tees, t-shirt, tee-shirt are synonyms for tshirt) and stored it as a part of our taxonomy. The taxonomy is used in data enrichment as well as search.
Rich Data leads to Rich Results
Data enrichment is a very crucial step since there is very limited information provided by the user along with the uploaded content.
We use image/video libraries to extract media metadata. Along with this, we trained domain specific ML models to extract richer meaningful information about the content.
The extracted information ranges from media metadata such as “aspect ratio”, “size”, “file types” etc., to much richer information, such as what “kind of dress” a “model” is wearing, what are the “dominant colours” in the image, whether it is a “male/female” model, if an image is a “logo image” or not, what “dish” it is, what are the major “ingredients” visible in the dish, etc.
Data enrichment is a continuous process. The more information you extract, the better the search results will be. We constantly strive to improve the ML models to better understand user content and thereby extract richer metadata.
What I search is what I (don’t) see
Even with the richest of content, the search will fail to give good results if the query is ill formed. Searching for the right content involves four main steps: Spell Correction, Query Interpretation, Query Expansion and Query Reformulation.
Human errors are unavoidable and that is where spell correction comes to rescue. We used the spell correction along with query interpretation to identify brands, categories, and other topics of interest. This step identifies the “meaning” of each of the terms in the user query with the help of the taxonomy created for different domains. For example, if a user types, “Steve madden red heels”, we interpret the query as: “colour: red”, “subcategory: heels”, “brand: steve madden”.
The next step is Query Expansion which adds related words to a query to increase the number of returned documents and improve recall. Query expansion is done using the synonyms list. Considering the same example: “steve madden red heels”. The colour red is expanded into different types of “red” such as: “dark red”, “firebrick”, “indian red”, etc., and “heels” gets expanded to “pumps”, “stilettos”, “gladiators”, etc.
The query reformulation step decides the importance of each of the identified terms. The rules for query reformulation are configured at the domain level. For instance, in the fashion domain, we identified that brand was more important than subcategory which in turn was more important than colour. The query is rewritten assigning weights to the terms from query expansion and the final reformulated query is then fired to the search engine.
We had to go through multiple iterations of query reformulation to get the best results for our users.
Along with browser uploads, the Smart Gallery supports syncing data from different cloud storage providers such as AWS S3, Azure Storage, Drive, Dropbox, Cloudinary, etc. The challenge here is to build real time ingestion pipelines to handle continuous syncing of large volumes of data.
We set up an event based pipeline to support continuous sync from different data sources into our Smart Gallery. We used BFS traversals to do incremental syncing of data since few of the cloud providers do not support update notifications.
Ensuring reliability of ingestion pipelines is very important as any downtime has direct user impact.
The Smart Gallery provides url based transformations backed by a CDN (Content Delivery Network). The transformations include “resize”, “rotate”, “gradient”, “filters”, etc., along with conversion to different file formats.
Supporting different file formats was a major challenge in transformations. We use different image libraries for handling different types of file formats such as “WandImage” for “psd”, “ai”, “eps”, “Rawkit” for “cr2”, “PIL” for “jpeg”, “png”, etc.
The transformations have to be fast and extensive. Since the media previews in Smart Gallery are rendered using the transformed urls, the faster the transformations, the faster the asset will be available for use.
We will dive deeper into the technical details of each of these components in the upcoming blogs. Stay tuned for it.
- Speedier Uploads
- Challenges in Query Reformulation
- Building Ingestion Pipelines
- Building Enrichment Pipelines
- Converting Unstructured data to Structured data
- Universal Search